Quick Start

If BDI-Kit is not installed yet, you can install it with:

[ ]:
! pip install bdi-kit

Then import the library:

[1]:
import bdikit as bdi
import pandas as pd

In this example, we are mapping a data sample (endometrial cancer) from Dou et al. 2020 to Dou et al. 2023.

We load our source data using Pandas:

[2]:
source_dataset = pd.read_csv("https://raw.githubusercontent.com/VIDA-NYU/bdi-kit/refs/heads/devel/examples/datasets/dou_2020_sample_part1.csv")
source_dataset.head(5)
[2]:
Country Histologic_type Path_Stage_Reg_Lymph_Nodes-pN Gender Progesterone_Receptor Num_full_term_pregnancies Histologic_Grade_FIGO FIGO_stage
0 United States Endometrioid pN1 (FIGO IIIC1) Female Cannot be determined Unknown FIGO grade 3 IIIC1
1 Ukraine Endometrioid pN0 Female Positive Unknown FIGO grade 3 IB
2 Ukraine Endometrioid pNX Female Positive 2 FIGO grade 1 IA
3 United States Endometrioid pNX Female Cannot be determined 3 FIGO grade 3 IVB
4 Other_specify Endometrioid pNX Female Cannot be determined 4 or more FIGO grade 1 IA

Then, we load the target dataset:

[3]:
target_dataset = pd.read_csv("https://raw.githubusercontent.com/VIDA-NYU/bdi-kit/refs/heads/devel/examples/datasets/dou_2023.csv")
target_dataset.head(5)
[3]:
Idx Case_id Case_excluded Batch Plex ReporterName Aliquot_ID Group Discovery_study Age ... Follow-up_additional_surgery_for_new_tumor Follow-up_additional_treatment_radiation_therapy_for_new_tumor Follow-up_additional_treatment_pharmaceutical_therapy_for_new_tumor Follow-up_additional_treatment_immuno_for_new_tumor Follow-up_days_from_date_of_collection_to_date_of_last_contact Follow-up_cause_of_death Follow-up_days_from_date_of_initial_pathologic_diagnosis_to_date_of_death Follow-up_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor Follow-up_procedure_type_of_new_tumor Follow-up_residual_tumor_after_surgery_for_new_tumor
0 C3L-00086 C3L-00086 No b4 16.0 128N CPT0092460003 Tumor No 56 ... n/a|No|No|No|No n/a|Yes|Yes|Yes|Yes n/a|Yes|Yes|Yes|Yes n/a|No|No|No|No 330.0|701.0|1046.0|1436.0|n/a n/a|n/a|n/a|n/a|Breast Carcinoma n/a|n/a|n/a|n/a|1578.0 n/a|n/a|n/a|n/a|n/a n/a|n/a|n/a|n/a|n/a n/a|n/a|n/a|n/a|n/a
1 C3L-00898 C3L-00898 No b4 14.0 128C CPT0172200008 Tumor No 54 ... n/a|n/a|n/a|n/a n/a|n/a|n/a|n/a n/a|n/a|n/a|n/a n/a|n/a|n/a|n/a 396.0|746.0|982.0|1600.0 n/a|n/a|n/a|n/a n/a|n/a|n/a|n/a n/a|n/a|n/a|n/a n/a|n/a|n/a|n/a n/a|n/a|n/a|n/a
2 C3L-00943 C3L-00943 No b4 15.0 130C CPT0086090003 Tumor No 63 ... n/a|n/a|n/a n/a|n/a|n/a n/a|n/a|n/a n/a|n/a|n/a 237.0|693.0|1039.0 n/a|n/a|n/a n/a|n/a|n/a n/a|n/a|n/a n/a|n/a|n/a n/a|n/a|n/a
3 C3L-01064 C3L-01064 No b3 9.0 129N CPT0113430004 Tumor No 54 ... No|No|No|No No|Yes|No|No Yes|Yes|Yes|Yes No|No|No|No 453.0|726.0|1062.0|1447.0 n/a|n/a|n/a|n/a n/a|n/a|n/a|n/a n/a|n/a|n/a|n/a n/a|n/a|n/a|n/a n/a|n/a|n/a|n/a
4 C3L-01277 C3L-01277 No b4 13.0 130N CPT0093170003 Tumor No 61 ... n/a|No|No n/a|No|Yes n/a|Yes|No n/a|No|No 351.0|713.0|967.0 n/a|n/a|n/a n/a|n/a|n/a n/a|n/a|n/a n/a|n/a|n/a n/a|n/a|n/a

5 rows × 213 columns

Configuring LLM Providers

Some BDI-Kit primitives rely on Large Language Models (LLMs), and commercial LLM providers require API credentials. BDI-Kit supports multiple providers, including OpenAI, Gemini, Anthropic, Ollama, DeepInfra, and others through LiteLLM-compatible integrations.

For example, to use OpenAI models, export your API key as an environment variable (BDI-Kit will read this variable):

export OPENAI_API_KEY="your_api_key_here"

Schema Matching

BDI-Kit can help with automatic discovery of one-to-one matches between the attributes/columns in the input (source) dataset and a target dataset schema. The target schema can be either another table or a data model such as the GDC (Genomic Data Commons).

To achieve this using BDI-Kit, we can use the match_schema() function as follows.

[4]:
attribute_matches = bdi.match_schema(source_dataset, target_dataset, method="magneto_ft_llm")
attribute_matches
[4]:
source_attribute target_attribute similarity
0 Num_full_term_pregnancies Donor_information_number_of_full_term_pregnancies 0.95
1 Histologic_type Histologic_Type 0.90
2 Gender Sex 0.90
3 Path_Stage_Reg_Lymph_Nodes-pN Pathologic_staging_regional_lymph_nodes_pn 0.90
4 Country Participant_country 0.90
5 Progesterone_Receptor Ancillary_studies_progesterone_receptor 0.85
6 Histologic_Grade_FIGO Histologic_grade 0.85
7 FIGO_stage Pathologic_staging_primary_tumor_pt 0.85

Value Matching

After finding the correct attribute matches, we need to find appropriate value matches. Using match_values(), we can inspect what the possible value matches for this would look like after the harmonization. BDI-Kit implements multiple methods for value matching discovery.

To specify a value matching approach, we can pass the method parameter.

[5]:
value_matches = bdi.match_values(source_dataset, target_dataset, attribute_matches=attribute_matches,  method="llm")
bdi.view_value_matches(value_matches)


Source attribute: Num_full_term_pregnancies
Target attribute: Donor_information_number_of_full_term_pregnancies
source_value target_value similarity
0 Unknown Unknown 1.0
1 2 2 1.0
2 3 3 1.0
3 4 or more 4 or more 1.0
4 1 1 1.0
5 NaN NaN NaN


Source attribute: Histologic_type
Target attribute: Histologic_Type
source_value target_value similarity
0 Endometrioid Endometrioid carcinoma 1.00
1 Clear cell Clear cell carcinoma 1.00
2 Serous Serous carcinoma 0.98
3 Carcinosarcoma Serous carcinoma 0.75


Source attribute: Gender
Target attribute: Sex
source_value target_value similarity
0 Female Female 1.0
1 NaN NaN NaN


Source attribute: Path_Stage_Reg_Lymph_Nodes-pN
Target attribute: Pathologic_staging_regional_lymph_nodes_pn
source_value target_value similarity
0 pN1 (FIGO IIIC1) pN1 (FIGO IIIC1) 1.0
1 pN0 pN0 1.0
2 pNX pNX 1.0
3 pN2 (FIGO IIIC2) pN2 (FIGO IIIC2) 1.0
4 NaN NaN NaN


Source attribute: Country
Target attribute: Participant_country
source_value target_value similarity
0 United States United States 1.0
1 Ukraine Ukraine 1.0
2 Poland Poland 1.0
3 Other_specify Other: Unknown 0.9
4 NaN NaN NaN


Source attribute: Progesterone_Receptor
Target attribute: Ancillary_studies_progesterone_receptor
source_value target_value similarity
0 Cannot be determined Cannot be determined 1.0
1 Positive Positive : % Not available 1.0
2 Unknown Cannot be determined 1.0
3 Negative Negative 1.0
4 NaN NaN NaN


Source attribute: Histologic_Grade_FIGO
Target attribute: Histologic_grade
source_value target_value similarity
0 FIGO grade 3 G3 Poorly differentiated 1.0
1 FIGO grade 1 G1 Well differentiated 1.0
2 FIGO grade 2 G2 Moderately differentiated 1.0
3 NaN NaN NaN


Source attribute: FIGO_stage
Target attribute: Pathologic_staging_primary_tumor_pt
source_value target_value similarity
0 IB pT1b (FIGO IB) 1.00
1 IA pT1a (FIGO IA) 1.00
2 IIIA pT3a (FIGO IIIA) 1.00
3 II pT2 (FIGO II) 1.00
4 IIIB pT3b (FIGO IIIB) 1.00
5 IVB pT4 ((FIGO IVA) 0.75
6 IIIC1 pT3b (FIGO IIIB) 0.60
7 IIIC2 pT3b 0.60
8 NaN NaN NaN

Materializing the Harmonized Dataset

Next, we generate the harmonization specification and use it to materialize the harmonized dataset. Below is a sample representation of the generated harmonization specification:

[6]:
harmonization_specification = bdi.create_harmonization_spec(value_matches)
harmonization_specification[-2:]
[6]:
[{'source_attribute': 'Histologic_Grade_FIGO',
  'target_attribute': 'Histologic_grade',
  'mapper': {'FIGO grade 3': 'G3 Poorly differentiated', 'FIGO grade 1': 'G1 Well differentiated', 'FIGO grade 2': 'G2 Moderately differentiated', nan: nan}},
 {'source_attribute': 'FIGO_stage',
  'target_attribute': 'Pathologic_staging_primary_tumor_pt',
  'mapper': {'IB': 'pT1b (FIGO IB)', 'IA': 'pT1a (FIGO IA)', 'IIIA': 'pT3a (FIGO IIIA)', 'II': 'pT2 (FIGO II)', 'IIIB': 'pT3b (FIGO IIIB)', 'IIIC1': 'pT3b (FIGO IIIB)', 'IVB': 'pT1b (FIGO IB)', 'IIIC2': 'pT3b (FIGO IIIB)', nan: nan}}]
[7]:
harmonized_dataset = bdi.materialize_mapping(source_dataset, harmonization_specification)
harmonized_dataset.head(5)
[7]:
Participant_country Histologic_Type Pathologic_staging_regional_lymph_nodes_pn Sex Ancillary_studies_progesterone_receptor Donor_information_number_of_full_term_pregnancies Histologic_grade Pathologic_staging_primary_tumor_pt
0 United States Endometrioid carcinoma pN1 (FIGO IIIC1) Female Cannot be determined Unknown G3 Poorly differentiated pT3b (FIGO IIIB)
1 Ukraine Endometrioid carcinoma pN0 Female Positive : % Not available Unknown G3 Poorly differentiated pT1b (FIGO IB)
2 Ukraine Endometrioid carcinoma pNX Female Positive : % Not available 2 G1 Well differentiated pT1a (FIGO IA)
3 United States Endometrioid carcinoma pNX Female Cannot be determined 3 G3 Poorly differentiated pT4 ((FIGO IVA)
4 Other: Unknown Endometrioid carcinoma pNX Female Cannot be determined 4 or more G1 Well differentiated pT1a (FIGO IA)

Reusing the Harmonization Specification

Instead of recomputing all matches, we reused the existing harmonization specification on a new version of the source dataset.

[8]:
new_dataset = pd.read_csv("https://raw.githubusercontent.com/VIDA-NYU/bdi-kit/refs/heads/devel/examples/datasets/dou_2020_sample_part2.csv")
new_dataset.head(5)
[8]:
Country Histologic_type Path_Stage_Reg_Lymph_Nodes-pN Gender Progesterone_Receptor Num_full_term_pregnancies Histologic_Grade_FIGO FIGO_stage
0 Ukraine Endometrioid pN0 Female Positive 1 FIGO grade 2 IA
1 Other_specify Endometrioid pN0 Female Cannot be determined 1 FIGO grade 1 IB
2 Ukraine Endometrioid pN0 Female Positive 1 FIGO grade 2 II
3 Ukraine Endometrioid pNX Female Positive NaN FIGO grade 2 IA
4 United States Endometrioid pNX Female Cannot be determined NaN FIGO grade 1 IA
[9]:
new_harmonized_dataset = bdi.materialize_mapping(new_dataset, harmonization_specification)
new_harmonized_dataset.head(5)
[9]:
Participant_country Histologic_Type Pathologic_staging_regional_lymph_nodes_pn Sex Ancillary_studies_progesterone_receptor Donor_information_number_of_full_term_pregnancies Histologic_grade Pathologic_staging_primary_tumor_pt
0 Ukraine Endometrioid carcinoma pN0 Female Positive : % Not available 1 G2 Moderately differentiated pT1a (FIGO IA)
1 Other: Unknown Endometrioid carcinoma pN0 Female Cannot be determined 1 G1 Well differentiated pT1b (FIGO IB)
2 Ukraine Endometrioid carcinoma pN0 Female Positive : % Not available 1 G2 Moderately differentiated pT2 (FIGO II)
3 Ukraine Endometrioid carcinoma pNX Female Positive : % Not available NaN G2 Moderately differentiated pT1a (FIGO IA)
4 United States Endometrioid carcinoma pNX Female Cannot be determined NaN G1 Well differentiated pT1a (FIGO IA)