Format of Participant-Level Data

There are three main formats for exporting participant-level data from PIC-SURE:

  1. Dataframe or CSV

  2. Time Series

  3. PFB

1. Dataframe or CSV Format

Participant-level data brought to an analysis platform using the Dataframe or CSV format will be exported in a single table. In this table, each row represents a participant and each column represents a variable. The variables included in the table are those added as filters to the query and exported with the "Add Variable" action.

Example Table: Dataframe Format

patient_id
\Nhanes\demographics\SEX\
\Nhanes\demographics\AGE\
\Nhanes\laboratory\acrylamide\

1001

male

31

Yes

1003

male

56

No

1004

female

83

Yes

1005

female

26

Yes

  • patient_id is a PIC-SURE-generated participant identifier.

  • Each column is labeled with the concept path of the variable.

2. Time Series Format

The time series format of the data includes a row for each participant and variable, along with timestamp information.

Currently, the only dataset with timestamps is the Synthea dataset.

The other datasets (NHANES and 1000 Genomes) do not contain timestamp information. For this reason, the Time Series export format is not recommended for the NHANES or 1000 Genomes datasets.

Example Table of Time Series Export

PATIENT_NUM
CONCEPT_PATH
NVAL_NUM
TVAL_CHAR
TIMESTAMP

1001

\Synthea\ACT Demographics\Sex\

NaN

Female

2025-02-14 05:00:00

1001

\Synthea\ACT Demographics\Age\

10

NaN

2025-02-14 05:00:00

1001

\Synthea\ACT Diagnosis ICD-10\H00-H59 Diseases of the eye and adnexa (H00-H59)\H40-H42 Glaucoma (H40-H42)\

NaN

H42 Glaucoma in diseases classified elsewhere

2023-08-12 12:05:31

1004

\Synthea\ACT Demographics\Sex\

NaN

Male

2025-02-14 05:00:00

1004

\Synthea\ACT Demographics\Age\

8

NaN

2025-02-14 05:00:00

1004

\Synthea\ACT Diagnosis ICD-10\H00-H59 Diseases of the eye and adnexa (H00-H59)\H40-H42 Glaucoma (H40-H42)\

NaN

H42 Glaucoma in diseases classified elsewhere

2024-04-19 13:41:06

  • PATIENT_NUM: PIC-SURE generated patient identifier.

  • CONCEPT_PATH: Indicates the variable information or concept path that was included based on the initial query.

  • NVAL_NUM: If the variable data is numeric or continuous, this column will display the numeric value of the data. If the data is categorical, this column will show NaN.

  • TVAL_CHAR: If the variable data is categorical, this column will display the string value of the data. If the data is continuous, this column will show NaN.

  • TIMESTAMP: The timestamp associated with the data.

3. Portable Format for Biomedical Data (PFB)

Participant-level data brought to an analysis platform using the PFB format will be exported in a single file, comprising two tables: the data table and the data dictionary table.

The data will be labeled as pic_sure_patients_[dataset ID] and show the participant-level data from PIC-SURE. The columns of this table are the variables, which are labeled as the PIC-SURE concept paths.

The data dictionary will be labeled as "pic_sure_data_dicitonary_[dataset ID]" and will contain information about the variables that have been exported. This includes information about each variable, such as the concept path, description, and display name. The data dictionary also includes DRS URIs, or links to the original data file, which can be used to access the files for further analysis in BDC analysis platforms.

Example Table: Data Table of PFB

patient_id
\Nhanes\demographics\SEX\
\Nhanes\demographics\AGE\
\Nhanes\laboratory\acrylamide\

1001

male

31

Yes

1003

male

56

No

1004

female

83

Yes

1005

female

26

Yes

  • patient_id is a PIC-SURE-generated participant identifier.

  • Each column is labeled with the concept path of the variable.

Example Table: Data Dictionary Table of PFB

concept_path
dataset
description
display_name

\Nhanes\demographics\SEX\

Nhanes

SEX

SEX

\Nhanes\demographics\AGE\

Nhanes

AGE

AGE

\Nhanes\laboratory\acrylamide\

Nhanes

acrylamide

acrylamide

  • Each row of the data dictionary table corresponds to a column in the data table.

Last updated