NHLBI BioData Catalyst® Powered by PIC-SURE
  • NHLBI BioData Catalyst® Powered by PIC-SURE User Guide
    • Frequently Asked Questions
  • Introduction to PIC-SURE
    • General Layout
    • Browse vs. Explore
  • Browse
    • Browse All Data
    • Features of Browse
  • Explore
    • Log in to Explore
    • Features of Explore
      • Prepare for Analysis
      • PFB Handoff to BioData Catalyst Powered by Terra
      • PFB Handoff to BioData Catalyst Powered by Seven Bridges
    • Manage Datasets
  • Data in PIC-SURE
    • Data Organization in BDC-PIC-SURE
      • BDC-PIC-SURE Data Format
    • Available Data & Managing Data Access
      • Publicly Available Datasets
      • TOPMed and TOPMed Related Datasets
        • Harmonized Data (TOPMed DCC Harmonized Clinical Variables)
      • BioLINCC Datasets
      • CONNECTS Datasets
  • Prepare for Data Analysis Using the PIC-SURE API
    • What is the PIC-SURE API?
    • PIC-SURE Personal Access Token
    • Analysis in the BioData Catalyst Ecosystem
      • BDC Powered by Seven Bridges
      • BDC Powered by Terra
    • Data Dictionaries via PIC-SURE API
    • More information about the PIC-SURE API
  • Citation and Acknowledgement of BioData Catalyst
  • Release Notes
    • Release Notes
      • 2025 June 4 Release
      • 2025 May 22 Release
      • 2025 May 8 Release
      • 2025 April 3 Release
      • 2025 March 5 Release
      • 2025 February 10 Release
      • 2024 Release Notes
        • 2024 December 19 Release
        • 2024 November 21 Release
        • 2024 November 4 Release
        • 2024 October 3 Release
        • 2024 September 5 Release
        • 2024 August 20 Release
        • 2024 August 1 Release
        • 2024 June 18 Release
        • 2024 May 29/30 Release
        • 2024 May 10/14 Release
        • 2024 March 26/28 Release
        • 2024 February 20/22 Release
        • 2024 January 30/31
        • 2024 January 16 Release
        • 2024 June 27 Release
      • 2023 Release Notes
        • 2023 December 12/14 Release
        • 2023 November 17 Release
        • 2023 October 23/31 Releases
        • 2023 October 13 Release
        • 2023 October 6 Release
        • 2023 September 28 Release
        • 2023 August 29 Release
        • 2023 July 27 Release
        • 2023 May 25 Release
        • 2023 March 30 Release
        • 2023 January 26 Release
  • Video Tutorials
    • Introduction to BioData Catalyst Powered by PIC-SURE
    • Basics: Finding Variables
    • Basics: Applying a Filter on a Variable
    • Basics: Editing a Variable Filter
    • PIC-SURE Open Access: Interpreting the Results
    • PIC-SURE Authorized Access: Add Variables to Export
    • PIC-SURE Authorized Access: Applying a Genomic Filter
    • PIC-SURE Authorized Access: Variable Distributions Tool
    • PIC-SURE Open Application Programming Interface (API)
  • Appendix
    • Glossary
    • Appendix 1: BDC Identifiers - dbGaP, TOPMed, and PIC-SURE
    • Appendix 2: Table of TOPMed DCC Harmonized Variables in PIC-SURE
Powered by GitBook
On this page
  • BDC-PIC-SURE Data Ingestion
  • BDC Full Data Dictionary
  • PIC-SURE High Performance Data Store (HPDS)
  1. Data in PIC-SURE
  2. Data Organization in BDC-PIC-SURE

BDC-PIC-SURE Data Format

Additional information about the format of data in BDC-PIC-SURE

PreviousData Organization in BDC-PIC-SURENextAvailable Data & Managing Data Access

Last updated 4 months ago

BDC-PIC-SURE Data Ingestion

The study data files are opened to generate a full data dictionary for the dataset. dbGaP has registered some studies with table (pht) and variable (phv) accessions, which provide a hierarchy to the dataset. Datasets that do not have dbGaP registered tables or variables utilize the form and variable name assigned by the data submitter.

Dataset Types
Study Accession
Form Group*
Table Accession
Variable Group*
Variable Accession
Variable Name
Variable ID
Concept Path Example

dbGaP format

phsXXXXXX

N/A

phtXXXXXX

N/A

phvXXXXXXXX

Variable Name

N/A

\phs\pht\phv\variable name

Example: FHS

phs000007

N/A

pht003094

N/A

phv00177292

g3b0073

N/A

\phs000007\pht003094\phv00177292\g3b0073

non-dbGaP format

phsXXXXXX

Form Group Name (if included)

Form Name

Variable Group Name (if included)

N/A

Variable ID

Variable ID

\phs\variable name

Example: ACTIV-4a

phs002694

Adjudication Forms - Hematological event

ADJ PE: Pulmonary embolism

N/A

N/A

CEC_ID

CEC_ID

\phs002694\CEC_ID

Example: RECOVER Pediatric

phs003461

recover_pediatric_congenital

enrollment

demographics

N/A

biosex

biosex

\phs003461\recover_pediatric_congenital\enrollment\demographics\biosex\

*indicates information for non-dbGaP format studies only

BDC Full Data Dictionary

PIC-SURE allows robust searching of variables via their metadata. BDC-PIC-SURE metadata includes file-level data, data dictionaries, variable-level data, variables, and data values. To support this broad range of search capabilities, data dictionaries are assembled. The dataset with registered dbGaP tables and variables contains decoded data dictionaries. An example of a decoded data dictionary would be if 1 is the assigned value for Male and 2 for Female, and the researcher could search for Male or Female.

For some studies, the data dictionaries are submitted in a programmatically readable format. For other studies, the data dictionaries are assembled into programmatically readable decoded data dictionaries, which are documented here:

Discover is a publicly available tool with no login that displays aggregate counts. While the data dictionary includes information about stigmatizing variables, the participant-level data excludes information from these variables, which are associated with the following categories:

  • Mental health diagnoses/history/treatment

  • Illicit drug use history

  • Sexually transmitted disease diagnoses/history/treatment

  • Sexual history

  • Intellectual Achievement/Ability/Educational Attainment

  • Direct or surrogate identifiers of legal status

The data dictionary is stored in a Postgres database and information from that database is available via the PIC-SURE API. Below is a high-level diagram of the table structure of the database and how it relates to different elements of study and/or variable metadata.

PIC-SURE High Performance Data Store (HPDS)

For clinical data, datasets are stored as two files: metadata and data. The metadata file contains the internal data dictionary, high-level dataset-specific information, and file offsets for each variable's data within the data file. The data file contains data for three concepts: patient index, numerical index, and categorical index.

The table below displays the format of the variable-level data:

PATIENT_NUM
CONCEPT_PATH
NVAL_NUM
TVAL_CHAR
TIMESTAMP

Integer value used to identify the participant across data types

Numeric values

Categorical values

Timestamp associated with a concept

For genomic data, variants that are not represented in the database are not stored. Genomic sample data is stored separately from variant annotations in HPDS. Variant annotations are stored using the same Numerical Index, and Categorical Index described above, indexing variant IDs instead of patient IDs.

The PIC-SURE team has built a to identify stigmatizing variables to ensure reproducibility and scalability. This process requires human decision-making; for example, “sex” could be associated with the patient’s gender or sexual history. The list of variables that have been deemed not stigmatizing:

The list of stigmatizing variables has been documented here:

HPDS is a wide-column store, or a column-oriented DBM, . HPDS was built to support biomedical informatics use cases without requiring massive clustering as the datasets increase in scale; therefore it can manage arbitrarily large datasets with very little computing. By utilizing a flexible data model, HPDS can support different ontologies and data types, such as phenotypic (ie, eCRF, EHR), genomic, biosample metadata, imaging metadata, etc. The flexible data model allows researchers to search and query across different data types at the variable value and genomic variant level to retrieve participant-level information, rather than the file-level.

of the concept based on the ontology

https://github.com/hms-dbmi/pic-sure-metadata-curation
pipeline
https://github.com/hms-dbmi/biodata_catalyst_stigmatizing_variables/blob/new_search_conversion/stigmatizing_terms/terms_excluded.tsv
https://github.com/hms-dbmi/biodata_catalyst_stigmatizing_variables/blob/new_search_conversion/stigmatizing_terms/stigmatizing_keywords.tsv
NoSQL
database
Flexible path