Long-term data interoperability
Presented at the High Level Experts Group Meeting on e-Infrastructure for Scientific Data at the European Commission, Brussels, Belgium.
Feb 17-18, 2010
Puneet Kishor

Presented at the High Level Experts Group Meeting on e-Infrastructure for Scientific Data at the European Commission, Brussels, Belgium.
Feb 17-18, 2010
Puneet Kishor

The European Commission wanted to learn about Creative Commons' perspective on building a long-lasting scientific data infrastructure. I gave this presentation emphasizing CC's focus on true and deep interoperability of scientific data.
Creative Commons works to make knowledge sharing legal, easy and scalable
We work in the culture space (music, text, film, art), education (open educational resources, virtual textbooks), and science (biological materials transfer, data sharing, Open Access, semantic web, patents). We craft policy and legal tools to lower the barriers to knowledge sharing.
We believe that barriers to knowledge sharing are lowered by increasing interoperability.
A small menu of appropriate licenses
![]()
Some rights reserved vs. all rights reserved
On the one hand we have the traditional copyright system's "one size fits all," and on the hand we have a plethora of licenses with no easily distinguishing or discernible differences. Finding both options lacking, Creative Commons created a few, flexible, copyright licenses that are appropriate for different uses.
We have nurtured the porting of those licenses to different jurisdictions around the world.
We have avoided writing licenses that don't work across the combinations of spectra of regimes and IP types.
We provide a web-based license chooser

We have created the CC Network and the CC Mixter
We have explained those licenses in easy-to-understand terms, provding different versions of the same licenses that are readable by normal public, by lawyers, and by machines.
We have provided a license chooser that allows you to choose and apply a suitable license in just a few clicks of the mouse.
The CC Network includes: OpenID support allowing your CC Network profile to act as an OpenID; ability to identify your works with an official badge; ability to share your story.
ccMixter is a community music site featuring remixes licensed under Creative Commons where you can listen to, sample, mash-up, or interact with music in whatever way you want.
We make sharing scalable in the sense that a few licenses can be used by a half-billion objects on the web (probably double that now, as the number is now eight months old).
Using RDFa (resource description format in attributes) to encode our licenses results in machines parseable licenses.
We believe the lessons we have learned in making data sharing easy, legal and scalable can also be applied to e-Infrastructures for scientific data in general.
Interoperability is the key concept here. It is the opposite of barriers. As barriers are lowered, interoperability increases.
Interoperability occurs at many levels.
An underlying premise of an infrastructure for data is long term preservation.
In order to ensure accessibility, interoperability has to be a key design objective.
Truly interoperable data will be technologically, semantically and legally interoperable, thereby maximizing the chances for use, and thus, the returns on investment in building the infrastructure.
Problems arise when data are treated as property rather than a shared resource. Works of creative authorship are intellectual property, and can be protected by applying licenses. Data, in particular, raw data, are naturally occurring facts that may be discovered, not created. They have to remain free for the benefit of everyone.
Problems arise when data are treated as property rather than a shared resource. Works of creative authorship are intellectual property, and can be protected by applying licenses. Data, in particular, raw data, are naturally occurring facts that may be discovered, not created. They have to remain free for the benefit of everyone.
The fact-expression divide is a concept in copyright law which states that copyright does not protect ideas. Only the way in which an idea has been expressed is protectable by copyright.
Some courts have recognized that there are particular ideas that can only be expressed intelligibly in a limited number of ways. In these cases even the expression is unprotected, or extremely limited to verbatim copying only. This is called the merger doctrine in the United States.
This can have a chilling effect on innovation. Businesses hate uncertainty, and not knowing what they might be liable for in the future because of the license of some dataset they used today creates uncertainty.
| Original license (below) may be licensed as → | PD | BY | BY-NC | BY-NC-ND | BY-NC-SA | BY-ND | BY-SA |
|---|---|---|---|---|---|---|---|
| PD | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| BY | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| BY-NC | ✓ | ✓ | ✓ | ||||
| BY-NC-ND | |||||||
| BY-NC-SA | ✓ | ||||||
| BY-ND | |||||||
| BY-SA | ✓ |
Any dataset licensed with a non-derivative clause will not allow creation of new datasets from it legally.
| License matrix | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| PD/CC0 | BY | BY-NC | BY-NC-ND | BY-NC-ND-SA | BY-NC-SA | BY-ND | BY-SA | ARR | |
| PD/CC0 | PD/CC0 | BY | BY-NC | BY-NC-ND | BY-NC-ND-SA | BY-NC-SA | BY-ND | BY-SA | BY-SA |
| BY | BY | BY | BY-NC | BY-NC-ND | BY-NC-ND-SA | BY-NC-SA | BY-ND | BY-SA | |
| BY-NC | BY-NC | BY-NC | BY-NC | BY-NC-ND | BY-NC-ND-SA | BY-NC-SA | |||
| BY-NC-ND | BY-NC-ND | BY-NC-ND | BY-NC-ND | BY-NC-ND | BY-NC-ND-SA | BY-NC-ND | |||
| BY-NC-ND-SA | BY-NC-ND-SA | BY-NC-ND-SA | BY-NC-ND-SA | BY-NC-ND-SA | BY-NC-ND-SA | BY-NC-ND-SA | BY-NC-ND-SA | BY-NC-ND-SA | |
| BY-NC-SA | BY-NC-SA | BY-NC-SA | BY-NC-SA | BY-NC-ND-SA | BY-NC-SA | ||||
| BY-ND | BY-ND | BY-ND | BY-NC-ND | BY-NC-ND-SA | BY-ND | ||||
| BY-SA | BY-SA | BY-SA | BY-NC-ND-SA | BY-SA | |||||
| ARR | ARR | ||||||||
When disparate datasets are mixed together, the license of the resulting dataset is as open as the most restrictive license of the component sets. Hence, licensed data tend toward fewer degrees of freedom as they are mixed with other data.
| License matrix | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| PD/CC0 | BY | BY-NC | BY-NC-ND | BY-NC-ND-SA | BY-NC-SA | BY-ND | BY-SA | ARR | |
| PD/CC0 | |||||||||
| BY | ✗ | ||||||||
| BY-NC | ✗ | ✗ | ✗ | ||||||
| BY-NC-ND | ✗ | ✗ | ✗ | ||||||
| BY-NC-ND-SA | ✗ | ||||||||
| BY-NC-SA | ✗ | ✗ | ✗ | ✗ | |||||
| BY-ND | ✗ | ✗ | ✗ | ✗ | |||||
| BY-SA | ✗ | ✗ | ✗ | ✗ | ✗ | ||||
| ARR | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | |
When disparate datasets are mixed together, the license of the resulting dataset is as open as the most restrictive license of the component sets. Hence, licensed data tend toward fewer degrees of freedom as they are mixed with other data.
When data are treated as a networked, shared resource, users are encouraged to tap into data sources rather than copy them. This circumvents the issues triggered by copying. Note that this is applicable to large datasets which would be impractical to copy and replicate because of their large size.
When data are treated as a networked, shared resource, users are encouraged to tap into data sources rather than copy them. This circumvents the issues triggered by copying. Note that this is applicable to large datasets which would be impractical to copy and replicate because of their large size.
The protocol is motivated by interoperability of scientific data. The volume of scientific data, and their interconnectedness, makes integration a necessity. For example, life scientists must integrate data from across biology and chemistry to comprehend disease and discover cures, and climate change scientists must integrate data from wildly diverse disciplines to understand our current state and predict the impact of new policies.
The technical challenge of such integration is significant. The forest of terms and conditions around data make integration difficult legally. One approach might be to develop and recommend a single license: any data with this license can be integrated with any other data under this license.
But this approach, which implicitly builds on intellectual property rights and the ideas of licensing as understood in software and culture, is difficult to scale for scientific uses. There are too many databases under too many terms already, and it is unlikely that any one license or suite of licenses will have the correct mix of terms to gain critical mass and allow massive-scale machine integration of data.
Therefore we instead lay out principles for open access data and a protocol for implementing those principles, and we distribute an Open Access Data Mark and metadata for use on databases and data available under a successful implementation of the protocol.
Requesting and encouraging one type of behaviour, such as citation, through norms and terms of use rather than as a legal requirement based on copyright or contracts, is preferable over contracts.
We are aware that different disciplines and jurisdictions call for different approaches, and this is not always a one-size-fits-all solution. With requesting behaviour through norms and terms of use rather than a legal construct, various scientific disciplines have the ability to develop their own norms for citation, allowing for legal certainty without constraining one community to the norms of another.
Data growth data courtesy IDC. 2007. The Expanding Digital Universe. EMC Corporation
Large Hadron Collider information from WLCG Worldwide LHC Computing Grid
Find landlocked countries with population more than 15 million, sorted by population. 392,000 results.

Example courtesy Feigenbaum and Prud'hommeaux. 2009. SPARQL By Example. http://www.cambridgesemantics.com/2008/09/sparql-by-example/ accessed Feb 9. 2010.
Find landlocked countries with population more than 15 million, sorted by population. 8 results.
PREFIX type: <http://dbpedia.org/class/yago/>
PREFIX prop: <http://dbpedia.org/property/>
SELECT ?country_name ?population
WHERE {
?country a type:LandlockedCountries ;
rdfs:label ?country_name ;
prop:populationEstimate ?population .
FILTER (
?population > 15000000 &&
langMatches(lang(?country_name), "EN")
) .
} ORDER BY DESC(?population)
| country_name | population |
|---|---|
| Ethiopia | 82825000 |
| Uganda | 32710000 |
| Nepal | 29331000 |
| Afghanistan | 28150000 |
| Uzbekistan | 27606007 |
| Burkina Faso | 15757000 |
| Niger | 15290000 |
| Malawi | 15263000 |
Example courtesy Feigenbaum and Prud'hommeaux. 2009. SPARQL By Example. http://www.cambridgesemantics.com/2008/09/sparql-by-example/ accessed Feb 9. 2010.
Semantic interoperability allows mixing structured and non-structured data. Humans can retrieve information using familiar syntax, and computers can be programmed to extract information programmatically, thereby increasing the payback from investments in the repository.
The cost of semantically structuring data will drop so that it will become possible for everyone to provide and consume data easily.
But, getting there is not automatic or easy. We have to make conscious decisions today to get there tomorrow.

https://proteomecommons.org/tranche/examples/sciencecommons/choose.jsp
ProteomeCommons is a public proteomics database for annotations and other information linked to the Tranche data repository and to other resources. It provides public access to free, open-source proteomics tools and data.
The ProteomeCommons.org Tranche network is a cloud of computers that to which one can upload files and download files from. All files uploaded to the network are replicated several times to protect against their accidental loss. Files uploaded to the network can be of any size, can be of any file type, and can be encrypted with a passphrase of your choosing.
ProteomeCommons makes available all its data only under a CC0 waiver.
The Tropical Disease Initiative aims to provide a "kernel" for open source drug discovery. Such kernel should allow scientists from laboratories, universities, institutes, and corporations to work together for a common cause: find new drugs against tropical disieases such as Malaria or Tuberculosis.
The TDI kernel (v1.0) includes 297 potential drug targets against the 10 selected genomes and is freely and publicly accessible.
The SIDER Side Effect Resource represents an effort to aggregate dispersed public information on side effects. To our knowledge, no such resource exist in machine-readable form despite the importance of research on drugs and their effects.
The mission of the Personal Genome Project is to encourage the development of personal genomics technology and practices that: are effective, informative, and responsible; yield identifiable and improvable benefits at manageable levels of risk; and are broadly available for the good of the general public.
To achieve this mission we will build a framework for prototyping and evaluating personal genomics technology and practices at increasing scales
The Personal Genomes Project is committed to making research data from the PGP freely available to the public under a CC0 waiver.
Since 2004, WisconsinView has made aerial photography and satellite imagery of Wisconsin available to the public for free over the web. As part of the AmericaView consortium, WisconsinView supports access and use of these imagery collections through education, workforce development, and research.
Starting June 30, 2009, WisconsinView is making available all of its more than 6 Terabytes of imagery data under the CC0 Protocol provided by Creative Commons.
The MichiganView consortium makes available aerial photography and satellite imagery of Michigan to the public for free over the Web. As part of the AmericaView consortium, MichiganView supports access and use of these imagery collections through education, workforce development, and research.
Starting Jan 28, 2010, MichiganView is making available all of its more than 93 Gigabytes of Landsat 5 and 7, and NAIP imagery data in the public domain using the CC0 Waiver provided by Creative Commons.
Stating the design principles will allow one to develop an e-Infrastructure that has been built from inside-out to meet those objectives.
The only reason we put data in a computer is so we can take them out again. The data that are easier to get and work with get reused more
The only reason we put data in a computer is so we can take it out again. The data that are easier to get and work with get reused more.
It is important to have indexes against which the success of an e-Infrastructure can be measured. These indexes allow one to use resources most efficiently.
The success of an e-Infrastructure can be measured against its objectives -- Is it easy to put data in? Can data be kept securely for the long-term? Can private data be kept private and public data be easily accessible? Is it easy to take data out? Are the conditions under which the data may be used clear to understand and implement? Can data be retreived programmatically?
technology + law + meaning + community working together for a successful e-Infrastructure for scientific data
A successful e-Infrastructure requires many components: technology, a legal framework, meaningful structure, and, more important than anything, a community that nourishes and uses the data.
The community that uses open data is a varied one -- researchers, educators, students, governtment agencies, entrepreneurs, established businesses, and hackers. They don't have a established identify in common except for a common need for unencumbered data. This group has to be nurtured.
It is expensive to make open and available, expensive to create a long-lasting e-Infrastructure. It is even more expensive to not do it. The old adage fits perfectly: if you think education is expensive, try ignorance.