The All of Us Research Program, part of the National Institutes of Health, is a historic effort to collect and study data from a diverse cohort of one million or more participants living in the United States. Data is collected from participants through their electronic health records, surveys, wearable devices, and biospecimens. Researchers can explore publicly available data through the data browser, or apply to access the Researcher Workbench, which provides tools for analyzing restricted data within a secure cloud platform.

For more on obtaining access to All of Us data, please visit: HSHSL All of Us LibGuide

National COVID Cohort Collaborative (N3C)

The National COVID Cohort Collaborative (N3C) Data Enclave was launched by the National Center for Advancing Translational Sciences (NCATS) and the National Center for Data to Health (CD2H), in partnership with experts from Observational Health Data Sciences and Informatics (OHDSI), PCORnet, the Accrual to Clinical Trials (ACT) network, and TriNetX. The N3C aims to aggregate, harmonize, and make accessible vast amounts of clinical data nationwide to accelerate COVID-19 research and clinical care. With the uncertainty of the COVID-19 global pandemic, the scientific community and the Clinical and Translational Science Awards (CTSA) Program created the N3C as a partnership to overcome technical, regulatory, policy, and governance barriers to harmonizing and sharing individual-level clinical data.

For more on accessing N3C data, read: National COVID Cohort Collaborative (N3C) Data Now Available to UMB Researchers

UMB HSHSL has institutional memberships with the following two data repositories:

ICPSR

ICPSR is a massive archive for high quality, curated, social, political, and behavioral data. There are many smaller archives and collections living under the umbrella of ICPSR. Many of these collections contain a mix of public use and restricted data (which may require additional steps to access). A few of the many collections and series of interest for health researchers include:

Qualitative Data Repository

The Qualitative Data Repository focusing on data from qualitative and mixed-methods social science research. Small but growing fast!

UMB Data Catalog

Not a repository itself, but the UMB Data Catalog facilitates the discovery of datasets and research products created or used by UMB researchers. Each dataset record includes a description, keywords, contact information, links to associated articles, and more.

NIH Repositories and Datasets

NIH Repositories

To help researchers locate an appropriate repository for sharing or accessing data, The NIH maintains lists of domain-specific and generalist data sharing repositories that are each described by several properties.

NLM Dataset Catalog

The NLM Dataset Catalog is a freely available catalog of over 80,000 biomedical datasets available from various repositories. Adhering to FAIR data management principles, the Dataset Catalog allows users to search for specific biomedical datasets and navigate among biomedical datasets by linking descriptive data.

NCBI Datasets

The NCBI Datasets platform contains the National Center for Biotechnology Information's gene sequence data and metadata, including gene, genome, and taxonomy indexes. It is possible to search using a variety of identifiers, including organism names, assembly, WGS, or BioProject accessions.

NICHD Data and Specimen Hub (DASH)

The NICHD Data and Specimen Hub (DASH) is a centralized resource that allows researchers to share and access de-identified data from studies funded by NICHD. DASH also serves as a portal for requesting biospecimens from selected DASH studies.

US Healthcare Datasets

U.S. Census Bureau

Find Census data through the Census Bureau’s website.
IPUMS: Contains census and survey microdata from the US and around the world. IPUMS integration and documentation makes it easy to study change, conduct comparative research, merge information across data types, and analyze individuals within family and community contexts.

Centers for Disease Control (CDC)

The CDC provides access to a number of data products, including quick facts, vital statistics, disease surveillance, and large population and provider surveys, along with a number of online tools for exploring the data.

Browse and access public use datasets through the CDC’s data catalog
You can also query public use data through CDC WONDER
See a list of downloadable public data files available through the National Center for Health Statistics
Learn more about accessing restricted CDC NCHS data
Behavioral Risk Factor Surveillance System
Just need some quick numbers? Check out FastStats

Agency for Healthcare Research and Quality Data Tools

The AHRQ Data Tools website provides a centralized platform to access healthcare data from various AHRQ programs.

International Healthcare Datasets

The Demographic and Health Surveys (DHS) Program

The Demographic and Health Surveys (DHS) Program has collected, analyzed, and disseminated accurate and representative data on population, health, HIV and nutrition through more than 400 surveys in over 90 countries.

The World Health Organization Data Collections

The World Health Organization manages and maintains a wide range of data collections related to global health and well-being as mandated by our Member States.

UNICEF Data

UNICEF Data is the global go-to for data on children. It leads the collection, validation, analysis, use and communication of the most statistically sound, internationally comparable data on the situation of children and women around the world. Browse datasets on child survival and health, child nutrition, maternal health, water and sanitation, education, child protection, HIV/AIDS, and MDG (Millennium Development Goals) monitoring.

World Bank Group - Health Nutrition and Population Statistics DataBank

World Bank Group's Health Nutrition and Population Statistics DataBank provides key health, nutrition and population statistics gathered from a variety of international and national sources. Topics include global surgery, health financing, HIV/AIDS, immunization, infectious diseases, medical resources and usage, noncommunicable diseases, nutrition, population dynamics, reproductive health, universal health coverage, and water and sanitation.

Genetics Data

GenBank®

GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. The GenBank® database is designed to provide and encourage access within the scientific community to the most up to date and comprehensive DNA sequence information. Therefore, NCBI places no restrictions on the use or distribution of the GenBank data.

Search GenBank for sequence identifiers and annotations with Nucleotide.

Gene

Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide.

Gene Expression Omnibus

Gene Expression Omnibus is a public functional genomics data repository supporting MIAME-compliant submissions of array- and sequence-based data. GEO archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community.

Gene Expression Omnibus Datasets

The GEO DataSets database stores curated gene expression DataSets, as well as original Series and Platform records in the Gene Expression Omnibus (GEO) repository. Enter search terms to locate experiments of interest. DataSet records contain additional resources including cluster tools and differential expression queries.

Database of Genomic Structural Variation (dbVar)

dbVar is NCBI's database of human genomic Structural Variation — large variants >50 bp including insertions, deletions, duplications, inversions, mobile elements, translocations, and complex variants. For more information on structural variation see the Overview of Structural Variation page.

Database of Genotypes and Phenotypes(dbGaP)

The database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in Humans.

Database of Single Nucleotide Polymorphisms (dbSNP)

dbSNP contains human single nucleotide variations, microsatellites, and small-scale insertions and deletions along with publication, population frequency, molecular consequence, and genomic and RefSeq mapping information for both common variations and clinical mutations.

Clinical Data

Vivli

Vivli is an independent general access data repository and search engine through which individual participant-level data and metadata from clinical trials conducted by researchers in academic, industry, foundation, and non-profit entities can be identified, hosted, shared, and analyzed.

Vivli is focused on sharing individual participant-level data from completed clinical trials to serve the international research community.

YODA (Yale University Open Data Access)

The YODA Project is a Yale University project to promote open data in clinical research, in collaboration with Johnson & Johnson, Medtronic, Inc., Queen Mary University of London, and SI-BONE.

COVID-19 Data

Many sources listed in this guide contain special COVID-19 collections, below are a few additional sources of COVID-19 data.

New York Times

The New York Times maintains a repository of data on coronavirus cases and deaths in the U.S.

CORD-19

CORD-19 is a corpus of academic papers about COVID-19 and related coronavirus research. It's curated and maintained by the Semantic Scholar team at the Allen Institute for AI to support text mining and NLP research.

Health and Aging Data

Health and Retirement Study

The University of Michigan Health and Retirement Study (HRS) is a longitudinal panel study that surveys a representative sample of approximately 20,000 Americans over 50 years, providing a comprehensive look at the changing experiences of older Americans on a range of topics including physical and mental health history and status, cognition, family structure, health care utilization and costs, financial status, and employment history and retirement plans. Interviews are supplemented with information from physical measurements, biomarker and genetic data, and linkages to administrative data.

National Health and Aging Trends Study

The National Health and Aging Trends Study contains interview data from a nationally representative sample of Medicare beneficiaries ages 65 or older.

Health Care Costs

Medicare/Medicaid

Public Use data from the Centers for Medicare and Medicaid Services can be found through their data catalog.

The Pharmaceutical Research Computing Center at the UMB School of Pharmacy can provide access to researchers for the following large claims datasets:

Medicare Chronic Conditions Warehouse (CCW) 5% Sample
Medicare Current Beneficiary Survey (MCBS)
IMS Health Pharmetrics Plus 10% Sample

Medicaid State Drug Utilization Data

State Drug Utilization Data (SDUD) has been reported by states since the start of the Medicaid Drug Rebate Program for covered outpatient drugs paid for by state Medicaid agencies. State and national level data are available since 1991.

Datasets include quarterly number of outpatient prescriptions, total units, and reimbursement costs.

HCUP (Healthcare Cost and Utilization Project)

HCUP, a project of the Agency for Healthcare Research and Quality, is a family of administrative longitudinal databases contains encounter-level information on inpatient stays, emergency department visits, and ambulatory surgery in U.S. hospitals.

MEPS (Medical Expenditure Panel Survey)

MEPS is a set of large-scale surveys of families and individuals, their medical providers, and employers across the United States.

Mental Health and Substance Abuse Datasets

Behavioral Risk Factor Surveillance System (BRFSS)

The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

Substance Abuse and Mental Health Services Administration (SAMHSA)

SAMHSA collects data through multiple sources and surveys and provides access to public-use data files and documentation to support a better understanding of mental illness and substance use disorders in America.

The National Addiction & HIV Data Archive Program (NAHDAP)

The National Addiction & HIV Data Archive Program facilitates research on drug addiction and HIV infection by acquiring, enhancing, preserving, and sharing data produced by research grants, particularly those funded by the National Institute on Drug Abuse. NAHDAP supports secondary data analysis through technical assistance and specialized training for data depositors and data users in the drug addiction and HIV research and policy communities.

Neurology/MRI Data

OpenNEURO

OpenNEURO is a free and open platform for validating and sharing BIDS-compliant MRI, PET, MEG, EEG, and iEEG data. Browse and explore public datasets and analyses from a wide range of global contributors. The OpenNUERO collection of public datasets continues to grow as more and more become BIDS compatible.

NeuroMorpho

NeuroMorpho.Org is a centrally curated inventory of digitally reconstructed neurons and glia associated with peer-reviewed publications. It contains contributions from over 900 laboratories worldwide and is continuously updated as new morphological reconstructions are collected, published, and shared.

Generalist Repositories

These repositories allow researchers from any discipline to upload their data. These can be a great place to find smaller datasets and datasets associated with particular research articles.

Figshare

Figshare is a cross-disciplinary repository where users and institutions can upload datasets, supported by Digital Science. All items, collections, and projects on Figshare are fully searchable using either simple search or advanced, in-field search. The content can be ordered and filtered by the users.

Zenodo

Zenodo is an open research data repository from CERN, the European Organization for Nuclear Research. Search for records in all of Zenodo. The search field is contextualized, so if you're browsing a community you can choose between searching only the community or all of Zenodo.

Dryad

Dryad is a nonprofit repository for data underlying the international scientific and medical literature. Browse or search and filter datasets by geographical location, subject, journal, or institution

Harvard Dataverse

Harvard Dataverse is a general-purpose data repository built on open-source software that is intended for sharing and facilitating citation of research data.

Open Science Framework (OSF)

OSF is a free, open platform to support your research and enable collaboration. Discover projects, data, materials, and collaborators on OSF that might be helpful to your own research.

For a comparison of these repositories check out these resources:

Generalist Repository Comparison Chart - Zenodo
General Repository Comparison - Fair Sharing

The NIH Office of Data Science Strategy (ODSS) announced the Generalist Repository Ecosystem Initiative (GREI)

GREI currently includes seven established generalist repositories that will work together to establish consistent metadata, develop use cases for data sharing, train and educate researchers on FAIR data and the importance of data sharing, and more.

Searching Across Repositories

Google Dataset Search

Google Dataset Search allows you to search by keyword to locate datasets across the web. Filter by date last updated, download format, usage rights, topic, and whether the dataset is free to access. Contains more than 31 million datasets from more than 4,600 internet domains. About half of these datasets come from .com domains, but .org and governmental domains also well represented.

Mendeley Data

Mendeley Data is a data index and open research data repository from publisher Elsevier where users can search across research data from 2000+ generalist and domain-specific repositories.

Search and filter results by date range, data type, source type (article or data repository), and source.

DataCite Commons

DataCite Commons describes works, people, organizations and repositories and their connections and allows users to search for them. They are identified by persistent identifiers (PIDs)—works (DOI), people (ORCID ID), organizations (ROR ID), and repositories (re3data repository ID)—and have standard metadata that describe them and the connections to each other.

Search 30 million works, nine million people and 100,000 organizations

Finding Data: Datasets by Topic

Browse by Topic

Featured Resources

All of Us