601 West Lombard Street
Baltimore MD 21201-1512
Reference: 410-706-7996
Circulation: 410-706-7928
UMB HSHSL maintains data use agreements with the following two data enclaves:
The All of Us Research Program, part of the National Institutes of Health, is a historic effort to collect and study data from a diverse cohort of one million or more participants living in the United States. Data is collected from participants through their electronic health records, surveys, wearable devices, and biospecimens. Researchers can explore publicly available data through the data browser, or apply to access the Researcher Workbench, which provides tools for analyzing restricted data within a secure cloud platform.
For more on obtaining access to All of Us data, please visit: HSHSL All of Us LibGuide
The National COVID Cohort Collaborative (N3C) Data Enclave was launched by the National Center for Advancing Translational Sciences (NCATS) and the National Center for Data to Health (CD2H), in partnership with experts from Observational Health Data Sciences and Informatics (OHDSI), PCORnet, the Accrual to Clinical Trials (ACT) network, and TriNetX. The N3C aims to aggregate, harmonize, and make accessible vast amounts of clinical data nationwide to accelerate COVID-19 research and clinical care. With the uncertainty of the COVID-19 global pandemic, the scientific community and the Clinical and Translational Science Awards (CTSA) Program created the N3C as a partnership to overcome technical, regulatory, policy, and governance barriers to harmonizing and sharing individual-level clinical data.
For more on accessing N3C data, read: National COVID Cohort Collaborative (N3C) Data Now Available to UMB Researchers
UMB HSHSL has institutional memberships with the following two data repositories:
ICPSR is a massive archive for high quality, curated, social, political, and behavioral data. There are many smaller archives and collections living under the umbrella of ICPSR. Many of these collections contain a mix of public use and restricted data (which may require additional steps to access). A few of the many collections and series of interest for health researchers include:
The Qualitative Data Repository focusing on data from qualitative and mixed-methods social science research. Small but growing fast!
Not a repository itself, but the UMB Data Catalog facilitates the discovery of datasets and research products created or used by UMB researchers. Each dataset record includes a description, keywords, contact information, links to associated articles, and more.
To help researchers locate an appropriate repository for sharing or accessing data, The NIH maintains lists of domain-specific and generalist data sharing repositories that are each described by several properties.
The NLM Dataset Catalog is a freely available catalog of over 80,000 biomedical datasets available from various repositories. Adhering to FAIR data management principles, the Dataset Catalog allows users to search for specific biomedical datasets and navigate among biomedical datasets by linking descriptive data.
The NCBI Datasets platform contains the National Center for Biotechnology Information's gene sequence data and metadata, including gene, genome, and taxonomy indexes. It is possible to search using a variety of identifiers, including organism names, assembly, WGS, or BioProject accessions.
The NICHD Data and Specimen Hub (DASH) is a centralized resource that allows researchers to share and access de-identified data from studies funded by NICHD. DASH also serves as a portal for requesting biospecimens from selected DASH studies.
The CDC provides access to a number of data products, including quick facts, vital statistics, disease surveillance, and large population and provider surveys, along with a number of online tools for exploring the data.
The AHRQ Data Tools website provides a centralized platform to access healthcare data from various AHRQ programs.
The Demographic and Health Surveys (DHS) Program has collected, analyzed, and disseminated accurate and representative data on population, health, HIV and nutrition through more than 400 surveys in over 90 countries.
The World Health Organization manages and maintains a wide range of data collections related to global health and well-being as mandated by our Member States.
UNICEF Data is the global go-to for data on children. It leads the collection, validation, analysis, use and communication of the most statistically sound, internationally comparable data on the situation of children and women around the world. Browse datasets on child survival and health, child nutrition, maternal health, water and sanitation, education, child protection, HIV/AIDS, and MDG (Millennium Development Goals) monitoring.
World Bank Group's Health Nutrition and Population Statistics DataBank provides key health, nutrition and population statistics gathered from a variety of international and national sources. Topics include global surgery, health financing, HIV/AIDS, immunization, infectious diseases, medical resources and usage, noncommunicable diseases, nutrition, population dynamics, reproductive health, universal health coverage, and water and sanitation.
GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. The GenBank® database is designed to provide and encourage access within the scientific community to the most up to date and comprehensive DNA sequence information. Therefore, NCBI places no restrictions on the use or distribution of the GenBank data.
Search GenBank for sequence identifiers and annotations with Nucleotide.
Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide.
Gene Expression Omnibus is a public functional genomics data repository supporting MIAME-compliant submissions of array- and sequence-based data. GEO archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community.
The GEO DataSets database stores curated gene expression DataSets, as well as original Series and Platform records in the Gene Expression Omnibus (GEO) repository. Enter search terms to locate experiments of interest. DataSet records contain additional resources including cluster tools and differential expression queries.
dbVar is NCBI's database of human genomic Structural Variation — large variants >50 bp including insertions, deletions, duplications, inversions, mobile elements, translocations, and complex variants. For more information on structural variation see the Overview of Structural Variation page.
The database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in Humans.
dbSNP contains human single nucleotide variations, microsatellites, and small-scale insertions and deletions along with publication, population frequency, molecular consequence, and genomic and RefSeq mapping information for both common variations and clinical mutations.
Vivli is an independent general access data repository and search engine through which individual participant-level data and metadata from clinical trials conducted by researchers in academic, industry, foundation, and non-profit entities can be identified, hosted, shared, and analyzed.
The YODA Project is a Yale University project to promote open data in clinical research, in collaboration with Johnson & Johnson, Medtronic, Inc., Queen Mary University of London, and SI-BONE.
Many sources listed in this guide contain special COVID-19 collections, below are a few additional sources of COVID-19 data.
The New York Times maintains a repository of data on coronavirus cases and deaths in the U.S.
CORD-19 is a corpus of academic papers about COVID-19 and related coronavirus research. It's curated and maintained by the Semantic Scholar team at the Allen Institute for AI to support text mining and NLP research.
The University of Michigan Health and Retirement Study (HRS) is a longitudinal panel study that surveys a representative sample of approximately 20,000 Americans over 50 years, providing a comprehensive look at the changing experiences of older Americans on a range of topics including physical and mental health history and status, cognition, family structure, health care utilization and costs, financial status, and employment history and retirement plans. Interviews are supplemented with information from physical measurements, biomarker and genetic data, and linkages to administrative data.
The National Health and Aging Trends Study contains interview data from a nationally representative sample of Medicare beneficiaries ages 65 or older.
Public Use data from the Centers for Medicare and Medicaid Services can be found through their data catalog.
The Pharmaceutical Research Computing Center at the UMB School of Pharmacy can provide access to researchers for the following large claims datasets:
State Drug Utilization Data (SDUD) has been reported by states since the start of the Medicaid Drug Rebate Program for covered outpatient drugs paid for by state Medicaid agencies. State and national level data are available since 1991.
HCUP, a project of the Agency for Healthcare Research and Quality, is a family of administrative longitudinal databases contains encounter-level information on inpatient stays, emergency department visits, and ambulatory surgery in U.S. hospitals.
MEPS is a set of large-scale surveys of families and individuals, their medical providers, and employers across the United States.
The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
SAMHSA collects data through multiple sources and surveys and provides access to public-use data files and documentation to support a better understanding of mental illness and substance use disorders in America.
The National Addiction & HIV Data Archive Program facilitates research on drug addiction and HIV infection by acquiring, enhancing, preserving, and sharing data produced by research grants, particularly those funded by the National Institute on Drug Abuse. NAHDAP supports secondary data analysis through technical assistance and specialized training for data depositors and data users in the drug addiction and HIV research and policy communities.
OpenNEURO is a free and open platform for validating and sharing BIDS-compliant MRI, PET, MEG, EEG, and iEEG data. Browse and explore public datasets and analyses from a wide range of global contributors. The OpenNUERO collection of public datasets continues to grow as more and more become BIDS compatible.
NeuroMorpho.Org is a centrally curated inventory of digitally reconstructed neurons and glia associated with peer-reviewed publications. It contains contributions from over 900 laboratories worldwide and is continuously updated as new morphological reconstructions are collected, published, and shared.
These repositories allow researchers from any discipline to upload their data. These can be a great place to find smaller datasets and datasets associated with particular research articles.
Figshare is a cross-disciplinary repository where users and institutions can upload datasets, supported by Digital Science. All items, collections, and projects on Figshare are fully searchable using either simple search or advanced, in-field search. The content can be ordered and filtered by the users.
Zenodo is an open research data repository from CERN, the European Organization for Nuclear Research. Search for records in all of Zenodo. The search field is contextualized, so if you're browsing a community you can choose between searching only the community or all of Zenodo.
Dryad is a nonprofit repository for data underlying the international scientific and medical literature. Browse or search and filter datasets by geographical location, subject, journal, or institution
Harvard Dataverse is a general-purpose data repository built on open-source software that is intended for sharing and facilitating citation of research data.
OSF is a free, open platform to support your research and enable collaboration. Discover projects, data, materials, and collaborators on OSF that might be helpful to your own research.
For a comparison of these repositories check out these resources:
The NIH Office of Data Science Strategy (ODSS) announced the Generalist Repository Ecosystem Initiative (GREI)
Google Dataset Search allows you to search by keyword to locate datasets across the web. Filter by date last updated, download format, usage rights, topic, and whether the dataset is free to access. Contains more than 31 million datasets from more than 4,600 internet domains. About half of these datasets come from .com domains, but .org and governmental domains also well represented.
Mendeley Data is a data index and open research data repository from publisher Elsevier where users can search across research data from 2000+ generalist and domain-specific repositories.
DataCite Commons describes works, people, organizations and repositories and their connections and allows users to search for them. They are identified by persistent identifiers (PIDs)—works (DOI), people (ORCID ID), organizations (ROR ID), and repositories (re3data repository ID)—and have standard metadata that describe them and the connections to each other.