BDC-PIC-SURE Data Format

Additional information about the format of data in BDC-PIC-SURE

BDC-PIC-SURE Data Ingestion

The study data files are opened to generate a full data dictionary for the dataset. dbGaP has registered some studies with table (pht) and variable (phv) accessions, which provide a hierarchy to the dataset. Datasets that do not have dbGaP registered tables or variables utilize the form and variable name assigned by the data submitter.

Dataset TypesStudy AccessionForm Group*Table AccessionVariable Group*Variable AccessionVariable NameVariable IDConcept Path Example

dbGaP format

phsXXXXXX

N/A

phtXXXXXX

N/A

phvXXXXXXXX

Variable Name

N/A

\phs\pht\phv\variable name

Example: FHS

phs000007

N/A

pht003094

N/A

phv00177292

g3b0073

N/A

\phs000007\pht003094\phv00177292\g3b0073

non-dbGaP format

phsXXXXXX

Form Group Name (if included)

Form Name

Variable Group Name (if included)

N/A

Variable ID

Variable ID

\phs\variable name

Example: ACTIV-4a

phs002694

Adjudication Forms - Hematological event

ADJ PE: Pulmonary embolism

N/A

N/A

CEC_ID

CEC_ID

\phs002694\CEC_ID

*indicates information for non-dbGaP format studies only

BDC Full Data Dictionary

PIC-SURE allows robust searching of variables via their metadata. BDC PIC-SURE metadata includes file-level data, data dictionaries, variable-level data, variables, and data values. To support this broad range of search capabilities, data dictionaries are assembled. The dataset with registered dbGaP tables and variables contains decoded data dictionaries. An example of a decoded data dictionary would be if 1 is the assigned value for Male and 2 for Female, and the researcher could search for Male or Female.

For some studies, the data dictionaries are submitted in a programmatically readable format. For other studies, the data dictionaries are assembled into programmatically readable decoded data dictionaries, which are documented here: https://github.com/hms-dbmi/pic-sure-metadata-curation

Open PIC-SURE is a publicly available website with no login that displays aggregate counts. The dataset excludes stigmatizing variables from the following categories: Mental health diagnoses/history/treatment; Illicit drug use history; Sexually transmitted disease diagnoses/history/treatment; Sexual history; Intellectual Achievement/Ability/Educational Attainment; Direct or surrogate identifiers of legal status. The PIC-SURE team has built a pipeline to identify stigmatizing variables to ensure reproducibility and scalability. This process requires human decision-making; for example, “sex” could be associated with the patient’s gender or sexual history. The list of variables that have been deemed not stigmatizing: https://github.com/hms-dbmi/biodata_catalyst_stigmatizing_variables/blob/new_search_conversion/stigmatizing_terms/terms_excluded.tsv

The list of stigmatizing variables has been documented here: https://github.com/hms-dbmi/biodata_catalyst_stigmatizing_variables/blob/new_search_conversion/stigmatizing_terms/stigmatizing_keywords.tsv

PIC-SURE High Performance Data Store (HPDS)

HPDS is a wide-column store, or a column-oriented DBM, NoSQL database. HPDS was built to support biomedical informatics use cases without requiring massive clustering as the datasets increase in scale; therefore it can manage arbitrarily large datasets with very little computing. By utilizing a flexible data model, HPDS can support different ontologies and data types, such as phenotypic (ie, eCRF, EHR), genomic, biosample metadata, imaging metadata, etc. The flexible data model allows researchers to search and query across different data types at the variable value and genomic variant level to retrieve participant-level information, rather than the file-level.

For clinical data, datasets are stored as two files: metadata and data. The metadata file contains the internal data dictionary, high-level dataset-specific information, and file offsets for each variable's data within the data file. The data file contains data for three concepts: patient index, numerical index, and categorical index.

The table below displays the format of the variable-level data:

PATIENT_NUMCONCEPT_PATHNVAL_NUMTVAL_CHARTIMESTAMP

Integer value used to identify the participant across data types

Flexible path of the concept based on the ontology

Numeric values

Categorical values

Timestamp associated with a concept

For genomic data, variants that are not represented in the database are not stored. Genomic sample data is stored separately from variant annotations in HPDS. Variant annotations are stored using the same Numerical Index, and Categorical Index described above, indexing variant IDs instead of patient IDs.

Last updated