Quick Start
If BDI-Kit is not installed yet, you can install it with:
[ ]:
! pip install bdi-kit
Then import the library:
[1]:
import bdikit as bdi
import pandas as pd
In this example, we are mapping a data sample (endometrial cancer) from Dou et al. 2020 to Dou et al. 2023.
We load our source data using Pandas:
[2]:
source_dataset = pd.read_csv("https://raw.githubusercontent.com/VIDA-NYU/bdi-kit/refs/heads/devel/examples/datasets/dou_2020_sample_part1.csv")
source_dataset.head(5)
[2]:
| Country | Histologic_type | Path_Stage_Reg_Lymph_Nodes-pN | Gender | Progesterone_Receptor | Num_full_term_pregnancies | Histologic_Grade_FIGO | FIGO_stage | |
|---|---|---|---|---|---|---|---|---|
| 0 | United States | Endometrioid | pN1 (FIGO IIIC1) | Female | Cannot be determined | Unknown | FIGO grade 3 | IIIC1 |
| 1 | Ukraine | Endometrioid | pN0 | Female | Positive | Unknown | FIGO grade 3 | IB |
| 2 | Ukraine | Endometrioid | pNX | Female | Positive | 2 | FIGO grade 1 | IA |
| 3 | United States | Endometrioid | pNX | Female | Cannot be determined | 3 | FIGO grade 3 | IVB |
| 4 | Other_specify | Endometrioid | pNX | Female | Cannot be determined | 4 or more | FIGO grade 1 | IA |
Then, we load the target dataset:
[3]:
target_dataset = pd.read_csv("https://raw.githubusercontent.com/VIDA-NYU/bdi-kit/refs/heads/devel/examples/datasets/dou_2023.csv")
target_dataset.head(5)
[3]:
| Idx | Case_id | Case_excluded | Batch | Plex | ReporterName | Aliquot_ID | Group | Discovery_study | Age | ... | Follow-up_additional_surgery_for_new_tumor | Follow-up_additional_treatment_radiation_therapy_for_new_tumor | Follow-up_additional_treatment_pharmaceutical_therapy_for_new_tumor | Follow-up_additional_treatment_immuno_for_new_tumor | Follow-up_days_from_date_of_collection_to_date_of_last_contact | Follow-up_cause_of_death | Follow-up_days_from_date_of_initial_pathologic_diagnosis_to_date_of_death | Follow-up_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor | Follow-up_procedure_type_of_new_tumor | Follow-up_residual_tumor_after_surgery_for_new_tumor | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C3L-00086 | C3L-00086 | No | b4 | 16.0 | 128N | CPT0092460003 | Tumor | No | 56 | ... | n/a|No|No|No|No | n/a|Yes|Yes|Yes|Yes | n/a|Yes|Yes|Yes|Yes | n/a|No|No|No|No | 330.0|701.0|1046.0|1436.0|n/a | n/a|n/a|n/a|n/a|Breast Carcinoma | n/a|n/a|n/a|n/a|1578.0 | n/a|n/a|n/a|n/a|n/a | n/a|n/a|n/a|n/a|n/a | n/a|n/a|n/a|n/a|n/a |
| 1 | C3L-00898 | C3L-00898 | No | b4 | 14.0 | 128C | CPT0172200008 | Tumor | No | 54 | ... | n/a|n/a|n/a|n/a | n/a|n/a|n/a|n/a | n/a|n/a|n/a|n/a | n/a|n/a|n/a|n/a | 396.0|746.0|982.0|1600.0 | n/a|n/a|n/a|n/a | n/a|n/a|n/a|n/a | n/a|n/a|n/a|n/a | n/a|n/a|n/a|n/a | n/a|n/a|n/a|n/a |
| 2 | C3L-00943 | C3L-00943 | No | b4 | 15.0 | 130C | CPT0086090003 | Tumor | No | 63 | ... | n/a|n/a|n/a | n/a|n/a|n/a | n/a|n/a|n/a | n/a|n/a|n/a | 237.0|693.0|1039.0 | n/a|n/a|n/a | n/a|n/a|n/a | n/a|n/a|n/a | n/a|n/a|n/a | n/a|n/a|n/a |
| 3 | C3L-01064 | C3L-01064 | No | b3 | 9.0 | 129N | CPT0113430004 | Tumor | No | 54 | ... | No|No|No|No | No|Yes|No|No | Yes|Yes|Yes|Yes | No|No|No|No | 453.0|726.0|1062.0|1447.0 | n/a|n/a|n/a|n/a | n/a|n/a|n/a|n/a | n/a|n/a|n/a|n/a | n/a|n/a|n/a|n/a | n/a|n/a|n/a|n/a |
| 4 | C3L-01277 | C3L-01277 | No | b4 | 13.0 | 130N | CPT0093170003 | Tumor | No | 61 | ... | n/a|No|No | n/a|No|Yes | n/a|Yes|No | n/a|No|No | 351.0|713.0|967.0 | n/a|n/a|n/a | n/a|n/a|n/a | n/a|n/a|n/a | n/a|n/a|n/a | n/a|n/a|n/a |
5 rows × 213 columns
Configuring LLM Providers
Some BDI-Kit primitives rely on Large Language Models (LLMs), and commercial LLM providers require API credentials. BDI-Kit supports multiple providers, including OpenAI, Gemini, Anthropic, Ollama, DeepInfra, and others through LiteLLM-compatible integrations.
For example, to use OpenAI models, export your API key as an environment variable (BDI-Kit will read this variable):
export OPENAI_API_KEY="your_api_key_here"
Schema Matching
BDI-Kit can help with automatic discovery of one-to-one matches between the attributes/columns in the input (source) dataset and a target dataset schema. The target schema can be either another table or a data model such as the GDC (Genomic Data Commons).
To achieve this using BDI-Kit, we can use the match_schema() function as follows.
[4]:
attribute_matches = bdi.match_schema(source_dataset, target_dataset, method="magneto_ft_llm")
attribute_matches
[4]:
| source_attribute | target_attribute | similarity | |
|---|---|---|---|
| 0 | Num_full_term_pregnancies | Donor_information_number_of_full_term_pregnancies | 0.95 |
| 1 | Histologic_type | Histologic_Type | 0.90 |
| 2 | Gender | Sex | 0.90 |
| 3 | Path_Stage_Reg_Lymph_Nodes-pN | Pathologic_staging_regional_lymph_nodes_pn | 0.90 |
| 4 | Country | Participant_country | 0.90 |
| 5 | Progesterone_Receptor | Ancillary_studies_progesterone_receptor | 0.85 |
| 6 | Histologic_Grade_FIGO | Histologic_grade | 0.85 |
| 7 | FIGO_stage | Pathologic_staging_primary_tumor_pt | 0.85 |
Value Matching
After finding the correct attribute matches, we need to find appropriate value matches. Using match_values(), we can inspect what the possible value matches for this would look like after the harmonization. BDI-Kit implements multiple methods for value matching discovery.
To specify a value matching approach, we can pass the method parameter.
[5]:
value_matches = bdi.match_values(source_dataset, target_dataset, attribute_matches=attribute_matches, method="llm")
bdi.view_value_matches(value_matches)
Source attribute: Num_full_term_pregnancies
Target attribute: Donor_information_number_of_full_term_pregnancies
| source_value | target_value | similarity | |
|---|---|---|---|
| 0 | Unknown | Unknown | 1.0 |
| 1 | 2 | 2 | 1.0 |
| 2 | 3 | 3 | 1.0 |
| 3 | 4 or more | 4 or more | 1.0 |
| 4 | 1 | 1 | 1.0 |
| 5 | NaN | NaN | NaN |
Source attribute: Histologic_type
Target attribute: Histologic_Type
| source_value | target_value | similarity | |
|---|---|---|---|
| 0 | Endometrioid | Endometrioid carcinoma | 1.00 |
| 1 | Clear cell | Clear cell carcinoma | 1.00 |
| 2 | Serous | Serous carcinoma | 0.98 |
| 3 | Carcinosarcoma | Serous carcinoma | 0.75 |
Source attribute: Gender
Target attribute: Sex
| source_value | target_value | similarity | |
|---|---|---|---|
| 0 | Female | Female | 1.0 |
| 1 | NaN | NaN | NaN |
Source attribute: Path_Stage_Reg_Lymph_Nodes-pN
Target attribute: Pathologic_staging_regional_lymph_nodes_pn
| source_value | target_value | similarity | |
|---|---|---|---|
| 0 | pN1 (FIGO IIIC1) | pN1 (FIGO IIIC1) | 1.0 |
| 1 | pN0 | pN0 | 1.0 |
| 2 | pNX | pNX | 1.0 |
| 3 | pN2 (FIGO IIIC2) | pN2 (FIGO IIIC2) | 1.0 |
| 4 | NaN | NaN | NaN |
Source attribute: Country
Target attribute: Participant_country
| source_value | target_value | similarity | |
|---|---|---|---|
| 0 | United States | United States | 1.0 |
| 1 | Ukraine | Ukraine | 1.0 |
| 2 | Poland | Poland | 1.0 |
| 3 | Other_specify | Other: Unknown | 0.9 |
| 4 | NaN | NaN | NaN |
Source attribute: Progesterone_Receptor
Target attribute: Ancillary_studies_progesterone_receptor
| source_value | target_value | similarity | |
|---|---|---|---|
| 0 | Cannot be determined | Cannot be determined | 1.0 |
| 1 | Positive | Positive : % Not available | 1.0 |
| 2 | Unknown | Cannot be determined | 1.0 |
| 3 | Negative | Negative | 1.0 |
| 4 | NaN | NaN | NaN |
Source attribute: Histologic_Grade_FIGO
Target attribute: Histologic_grade
| source_value | target_value | similarity | |
|---|---|---|---|
| 0 | FIGO grade 3 | G3 Poorly differentiated | 1.0 |
| 1 | FIGO grade 1 | G1 Well differentiated | 1.0 |
| 2 | FIGO grade 2 | G2 Moderately differentiated | 1.0 |
| 3 | NaN | NaN | NaN |
Source attribute: FIGO_stage
Target attribute: Pathologic_staging_primary_tumor_pt
| source_value | target_value | similarity | |
|---|---|---|---|
| 0 | IB | pT1b (FIGO IB) | 1.00 |
| 1 | IA | pT1a (FIGO IA) | 1.00 |
| 2 | IIIA | pT3a (FIGO IIIA) | 1.00 |
| 3 | II | pT2 (FIGO II) | 1.00 |
| 4 | IIIB | pT3b (FIGO IIIB) | 1.00 |
| 5 | IVB | pT4 ((FIGO IVA) | 0.75 |
| 6 | IIIC1 | pT3b (FIGO IIIB) | 0.60 |
| 7 | IIIC2 | pT3b | 0.60 |
| 8 | NaN | NaN | NaN |
Materializing the Harmonized Dataset
Next, we generate the harmonization specification and use it to materialize the harmonized dataset. Below is a sample representation of the generated harmonization specification:
[6]:
harmonization_specification = bdi.create_harmonization_spec(value_matches)
harmonization_specification[-2:]
[6]:
[{'source_attribute': 'Histologic_Grade_FIGO',
'target_attribute': 'Histologic_grade',
'mapper': {'FIGO grade 3': 'G3 Poorly differentiated', 'FIGO grade 1': 'G1 Well differentiated', 'FIGO grade 2': 'G2 Moderately differentiated', nan: nan}},
{'source_attribute': 'FIGO_stage',
'target_attribute': 'Pathologic_staging_primary_tumor_pt',
'mapper': {'IB': 'pT1b (FIGO IB)', 'IA': 'pT1a (FIGO IA)', 'IIIA': 'pT3a (FIGO IIIA)', 'II': 'pT2 (FIGO II)', 'IIIB': 'pT3b (FIGO IIIB)', 'IIIC1': 'pT3b (FIGO IIIB)', 'IVB': 'pT1b (FIGO IB)', 'IIIC2': 'pT3b (FIGO IIIB)', nan: nan}}]
[7]:
harmonized_dataset = bdi.materialize_mapping(source_dataset, harmonization_specification)
harmonized_dataset.head(5)
[7]:
| Participant_country | Histologic_Type | Pathologic_staging_regional_lymph_nodes_pn | Sex | Ancillary_studies_progesterone_receptor | Donor_information_number_of_full_term_pregnancies | Histologic_grade | Pathologic_staging_primary_tumor_pt | |
|---|---|---|---|---|---|---|---|---|
| 0 | United States | Endometrioid carcinoma | pN1 (FIGO IIIC1) | Female | Cannot be determined | Unknown | G3 Poorly differentiated | pT3b (FIGO IIIB) |
| 1 | Ukraine | Endometrioid carcinoma | pN0 | Female | Positive : % Not available | Unknown | G3 Poorly differentiated | pT1b (FIGO IB) |
| 2 | Ukraine | Endometrioid carcinoma | pNX | Female | Positive : % Not available | 2 | G1 Well differentiated | pT1a (FIGO IA) |
| 3 | United States | Endometrioid carcinoma | pNX | Female | Cannot be determined | 3 | G3 Poorly differentiated | pT4 ((FIGO IVA) |
| 4 | Other: Unknown | Endometrioid carcinoma | pNX | Female | Cannot be determined | 4 or more | G1 Well differentiated | pT1a (FIGO IA) |
Reusing the Harmonization Specification
Instead of recomputing all matches, we reused the existing harmonization specification on a new version of the source dataset.
[8]:
new_dataset = pd.read_csv("https://raw.githubusercontent.com/VIDA-NYU/bdi-kit/refs/heads/devel/examples/datasets/dou_2020_sample_part2.csv")
new_dataset.head(5)
[8]:
| Country | Histologic_type | Path_Stage_Reg_Lymph_Nodes-pN | Gender | Progesterone_Receptor | Num_full_term_pregnancies | Histologic_Grade_FIGO | FIGO_stage | |
|---|---|---|---|---|---|---|---|---|
| 0 | Ukraine | Endometrioid | pN0 | Female | Positive | 1 | FIGO grade 2 | IA |
| 1 | Other_specify | Endometrioid | pN0 | Female | Cannot be determined | 1 | FIGO grade 1 | IB |
| 2 | Ukraine | Endometrioid | pN0 | Female | Positive | 1 | FIGO grade 2 | II |
| 3 | Ukraine | Endometrioid | pNX | Female | Positive | NaN | FIGO grade 2 | IA |
| 4 | United States | Endometrioid | pNX | Female | Cannot be determined | NaN | FIGO grade 1 | IA |
[9]:
new_harmonized_dataset = bdi.materialize_mapping(new_dataset, harmonization_specification)
new_harmonized_dataset.head(5)
[9]:
| Participant_country | Histologic_Type | Pathologic_staging_regional_lymph_nodes_pn | Sex | Ancillary_studies_progesterone_receptor | Donor_information_number_of_full_term_pregnancies | Histologic_grade | Pathologic_staging_primary_tumor_pt | |
|---|---|---|---|---|---|---|---|---|
| 0 | Ukraine | Endometrioid carcinoma | pN0 | Female | Positive : % Not available | 1 | G2 Moderately differentiated | pT1a (FIGO IA) |
| 1 | Other: Unknown | Endometrioid carcinoma | pN0 | Female | Cannot be determined | 1 | G1 Well differentiated | pT1b (FIGO IB) |
| 2 | Ukraine | Endometrioid carcinoma | pN0 | Female | Positive : % Not available | 1 | G2 Moderately differentiated | pT2 (FIGO II) |
| 3 | Ukraine | Endometrioid carcinoma | pNX | Female | Positive : % Not available | NaN | G2 Moderately differentiated | pT1a (FIGO IA) |
| 4 | United States | Endometrioid carcinoma | pNX | Female | Cannot be determined | NaN | G1 Well differentiated | pT1a (FIGO IA) |