Research by Subject: Data: Open Data

Open Data Repositories

SEARCHABLE LISTS OF DATABASES

Registry of Research Data Repositories (re3data.org)

A global registry of research data repositories from different academic disciplines.

A curated, annotated list of databases, with data policies and metadata standards.

Nature Journals recommended data repositories list

A curated list that includes both generalist repositories and specialized, discipline-specific repositories. The listed repositories meet the Nature Journals requirements for data access, preservation and stability.

Examples of generalist data repositories

Other data repositories

Inter-University Consortium for Political and Social Research (ICPSR)

(Note: Bucknell is an ICPSR member institution. Link to Bucknell when creating your user account to access membership benefits.)

ICPSR maintains a searchable data archive of research in the social and behavioral sciences. It hosts specialized collections of data in education, aging, criminal justice, substance abuse, terrorism, and other fields.
ICPSR users can examine and compare variables across studies using the Social Science Variables Database (SSVD).
ICPSR also provides leadership and training in data access, curation, and methods of analysis for the social science research community, including resources for teachers and students.

"Moving Beyond the Title: Evaluating The Data You Find": An ICPSR video tutorial on finding and evaluating datasets to answer a research question (July 2020).

Kaggle

Kaggle maintains a collection of publicly available datasets and data-analysis code on a range of topics, with simple metrics of data and documentation quality, for use in data science training and machine learning projects.
Kaggle also hosts real-life data-analysis tasks and competitions around the datasets.
Users have access to a customizable Jupyter Notebooks environment.
A global online community around data science and machine learning, from students to experts.

U.S. government's open data

Home of U.S. government's open data: https://www.data.gov/

Census data

Data about the American people
Searchable data from the Census survey (every 10 years; most recently in 2020), available in customizable summary tables and maps
Searchable data from the American Community Survey (yearly), available in customizable summary tables and maps
No individual-person level data is provided for privacy reasons, and the geographic resolution of the available summary data is set to protect data privacy as well

Economic Census data

Economic data about the American businesses
Searchable data from the Economic Census survey (every 5 years, most recently in 2017)

Examples of qualitative and text-based data repositories

Qualitative Data Repository

The Qualitative Data Repository (QDR) is a dedicated archive for storing and sharing digital data (and accompanying documentation) generated or collected through qualitative and multi-method research in the social sciences.
Provides leadership and training in—and works to develop and publicize common standards and practices for—managing, archiving, sharing, reusing, and citing qualitative data.

The Corpus of Contemporary American English (COCA)

Contains more than one billion words of text (25+ million words each year 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, as well as TV and movies subtitles, blogs, and other web pages.
Three main ways to search the corpus (including comparisons between genres and years):
- Frequency list
- Search by individual word
- Search for phrases and strings

Perseus Digital Library

Perseus Digital Library (Perseus 4.0) is a comprehensive and constantly expanded digital library, with a mission to create and make accessible the full record of humanity, including linguistic sources, physical artifacts, and historical spaces.
The flagship collection covers the history, literature and culture of the Greco-Roman world, and classical Greek and Latin.
Other collections include Arabic, Germanic, 19th-Century American, and Renaissance materials.

Museum collections

Open Access at The Met

In 2017, the Metropolitan Museum of Art (The Met) made all images of public-domain works in its collection available under the Creative Commons Zero (CC0) license, which allows unrestricted use, sharing, and remixing. The change reflects The Met's commitment to increasing access to the collection in a digital age.

Open Access at the Cleveland Museum of Art

The Cleveland Museum of Art became an Open Access institution in 2019. All the images of public-domain works in the collection are available under the Creative Commons Zero (CC0) license, and can be used, shared, and remixed without restrictions. In addition, portions of collections information (metadata) for more than 61,000 artworks, both in the public domain and those works with copyright or other restrictions, are now available. 

Bucknell Digital Commons

Bucknell Digital Commons, a service of Bucknell University Libraries, is an institutional repository that bring together all of Bucknell University's research and scholarship under one umbrella, with an aim to preserve and provide access to that research and scholarship.

The research and scholarly output included in Bucknell Digital Commons is selected and deposited by the individual university departments and centers on campus. The repository is an excellent vehicle for working papers or copies of published articles and conference papers, as well as presentations, senior theses, and other works not published elsewhere.

Submit your research to Bucknell Digital Commons

Most research can be submitted electronically. Click on the link above to submit your research. Some publications do not allow authors to submit directly. In these cases, you will be provided with a mail form to contact the appropriate administrator for further instruction.

Open Data Standards

Open data and content can be freely used, modified, and shared by anyone for any purpose.

The Open Definition (a project of the Open Knowledge Foundation) defines in detail the meaning of “open” with respect to knowledge, promoting a robust commons in which anyone may participate, and interoperability is maximized.

1. Open Works

An open work must satisfy the following requirements in its distribution:

Open License or Status: The work must be in the public domain or provided under an open license

Access: The work must be provided as a whole and at no more than a reasonable one-time reproduction cost, and should be downloadable via the Internet without charge.

Machine Readability: The work must be provided in a form readily processable by a computer and where the individual elements of the work can be easily accessed and modified.

Open Format: The work must be provided in an open format. An open format is one which places no restrictions, monetary or otherwise, upon its use and can be fully processed with at least one free/libre/open-source software tool.

2. Open Licenses

A license is open if its terms satisfy the following conditions:

Required Permissions: The license must irrevocably allow use, redistribution, modification, and compilation for any purpose. The license must not restrict anyone from making use of the work in a specific field of endeavor. The license must not impose any fee arrangement, royalty, or other compensation or monetary remuneration as part of its conditions.

Acceptable Conditions: The license may require distributions of the work to: include attribution of contributors, rights holders, sponsors, and creators as long as any such prescriptions are not onerous; and to remain under the same license or a similar license; among other conditions.

Open Definition 2.1, Open Knowledge Foundation, http://opendefinition.org/od/2.1/en/, accessed on August 12, 2019.

Open Data License

Attribution-NonCommercial 4.0 International license (CC BY-NC 4.0 license) is a Creative Commons license commonly used for scientific data.

Under this license, you are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material

The licensor cannot revoke these freedoms as long as you follow these license terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial — You may not use the material for commercial purposes.

No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Further reading: "Creative Commons Licenses: How to Choose the Best CC License" blog post (June 2021).

FAIR Data Principles

The FAIR Data Principles are intended to make digital data more Findable, Accessible, Interoperable, and Reusable, including by computational systems (since we increasingly rely on computational support to work with data).

Findable
The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services.

F1. (Meta)data are assigned a globally unique and persistent identifier

F2. Data are described with rich metadata (defined by R1 below)

F3. Metadata clearly and explicitly include the identifier of the data they describe

F4. (Meta)data are registered or indexed in a searchable resource

Accessible
Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorisation.

A1. (Meta)data are retrievable by their identifier using a standardised communications protocol

A1.1 The protocol is open, free, and universally implementable

A1.2 The protocol allows for an authentication and authorisation procedure, where necessary

A2. Metadata are accessible, even when the data are no longer available

Interoperable
The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.

I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (Meta)data use vocabularies that follow FAIR principles

I3. (Meta)data include qualified references to other (meta)data

Reusable
The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.

R1. Meta(data) are richly described with a plurality of accurate and relevant attributes

R1.1. (Meta)data are released with a clear and accessible data usage license

R1.2. (Meta)data are associated with detailed provenance

R1.3. (Meta)data meet domain-relevant community standards

Sources:

FAIR Principles. GO Fair website. Retrieved from https://www.go-fair.org/fair-principles/ on July 31, 2020.

Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18.

Open Data: References and Resources

Rethinking Research Data

Kristin Briney argues for publishing both the research findings and the research data, in order to improve the quality, reproducibility, and impact of research. "Data, or it didn't happen!" TED x UW-Milwaukee, 2015. Length: 15 minutes.

Open Data: Keep Learning

Open Data: Unleashing Hidden Value (LinkedIn Learning)

An online course on LinkedIn Learning with transcripts, exercise files, and self-assessment quizzes. Governments around the world are discovering the value and responsibility in making the data they collect and store easily available to anyone who wants to access it. Making the decision to open up data sets is a strategic choice that requires detailed tactics. There are processes and technologies to make data accessible while minimizing risk. If you want to start opening up your organization's data to enable transparency and catalyze innovation, or use open data to drive analysis and make more informed decisions, this course is for you. The course introduces real-world use cases for open data, as well as the steps you need to take to develop and operationalize an open data program, and measuring the value of open data.

Length: 1 hour, 10 minutes. (Free access with Bucknell login.)

Data Services Specialist

Katie Akateh

She/Her/Hers

Contact:

Bertrand Library
Research Help Area, Room 107A

ka025@bucknell.edu

Subjects: Data Services