Quick Start

If BDI-Kit is not installed yet, you can install it with:

[ ]:

! pip install bdi-kit

Then import the library:

[1]:

import bdikit as bdi
import pandas as pd

In this example, we are mapping a data sample (endometrial cancer) from Dou et al. 2020 to Dou et al. 2023.

We load our source data using Pandas:

[2]:

source_dataset = pd.read_csv("https://raw.githubusercontent.com/VIDA-NYU/bdi-kit/refs/heads/devel/examples/datasets/dou_2020_sample_part1.csv")
source_dataset.head(5)

[2]:

	Country	Histologic_type	Path_Stage_Reg_Lymph_Nodes-pN	Gender	Progesterone_Receptor	Num_full_term_pregnancies	Histologic_Grade_FIGO	FIGO_stage
0	United States	Endometrioid	pN1 (FIGO IIIC1)	Female	Cannot be determined	Unknown	FIGO grade 3	IIIC1
1	Ukraine	Endometrioid	pN0	Female	Positive	Unknown	FIGO grade 3	IB
2	Ukraine	Endometrioid	pNX	Female	Positive	2	FIGO grade 1	IA
3	United States	Endometrioid	pNX	Female	Cannot be determined	3	FIGO grade 3	IVB
4	Other_specify	Endometrioid	pNX	Female	Cannot be determined	4 or more	FIGO grade 1	IA

Then, we load the target dataset:

[3]:

target_dataset = pd.read_csv("https://raw.githubusercontent.com/VIDA-NYU/bdi-kit/refs/heads/devel/examples/datasets/dou_2023.csv")
target_dataset.head(5)

[3]:

	Idx	Case_id	Case_excluded	Batch	Plex	ReporterName	Aliquot_ID	Group	Discovery_study	Age	...	Follow-up_additional_surgery_for_new_tumor	Follow-up_additional_treatment_radiation_therapy_for_new_tumor	Follow-up_additional_treatment_pharmaceutical_therapy_for_new_tumor	Follow-up_additional_treatment_immuno_for_new_tumor	Follow-up_days_from_date_of_collection_to_date_of_last_contact	Follow-up_cause_of_death	Follow-up_days_from_date_of_initial_pathologic_diagnosis_to_date_of_death	Follow-up_days_from_date_of_initial_pathologic_diagnosis_to_date_of_additional_surgery_for_new_tumor	Follow-up_procedure_type_of_new_tumor	Follow-up_residual_tumor_after_surgery_for_new_tumor
0	C3L-00086	C3L-00086	No	b4	16.0	128N	CPT0092460003	Tumor	No	56	...	n/a\|No\|No\|No\|No	n/a\|Yes\|Yes\|Yes\|Yes	n/a\|Yes\|Yes\|Yes\|Yes	n/a\|No\|No\|No\|No	330.0\|701.0\|1046.0\|1436.0\|n/a	n/a\|n/a\|n/a\|n/a\|Breast Carcinoma	n/a\|n/a\|n/a\|n/a\|1578.0	n/a\|n/a\|n/a\|n/a\|n/a	n/a\|n/a\|n/a\|n/a\|n/a	n/a\|n/a\|n/a\|n/a\|n/a
1	C3L-00898	C3L-00898	No	b4	14.0	128C	CPT0172200008	Tumor	No	54	...	n/a\|n/a\|n/a\|n/a	n/a\|n/a\|n/a\|n/a	n/a\|n/a\|n/a\|n/a	n/a\|n/a\|n/a\|n/a	396.0\|746.0\|982.0\|1600.0	n/a\|n/a\|n/a\|n/a	n/a\|n/a\|n/a\|n/a	n/a\|n/a\|n/a\|n/a	n/a\|n/a\|n/a\|n/a	n/a\|n/a\|n/a\|n/a
2	C3L-00943	C3L-00943	No	b4	15.0	130C	CPT0086090003	Tumor	No	63	...	n/a\|n/a\|n/a	n/a\|n/a\|n/a	n/a\|n/a\|n/a	n/a\|n/a\|n/a	237.0\|693.0\|1039.0	n/a\|n/a\|n/a	n/a\|n/a\|n/a	n/a\|n/a\|n/a	n/a\|n/a\|n/a	n/a\|n/a\|n/a
3	C3L-01064	C3L-01064	No	b3	9.0	129N	CPT0113430004	Tumor	No	54	...	No\|No\|No\|No	No\|Yes\|No\|No	Yes\|Yes\|Yes\|Yes	No\|No\|No\|No	453.0\|726.0\|1062.0\|1447.0	n/a\|n/a\|n/a\|n/a	n/a\|n/a\|n/a\|n/a	n/a\|n/a\|n/a\|n/a	n/a\|n/a\|n/a\|n/a	n/a\|n/a\|n/a\|n/a
4	C3L-01277	C3L-01277	No	b4	13.0	130N	CPT0093170003	Tumor	No	61	...	n/a\|No\|No	n/a\|No\|Yes	n/a\|Yes\|No	n/a\|No\|No	351.0\|713.0\|967.0	n/a\|n/a\|n/a	n/a\|n/a\|n/a	n/a\|n/a\|n/a	n/a\|n/a\|n/a	n/a\|n/a\|n/a

5 rows × 213 columns

Configuring LLM Providers

Some BDI-Kit primitives rely on Large Language Models (LLMs), and commercial LLM providers require API credentials. BDI-Kit supports multiple providers, including OpenAI, Gemini, Anthropic, Ollama, DeepInfra, and others through LiteLLM-compatible integrations.

For example, to use OpenAI models, export your API key as an environment variable (BDI-Kit will read this variable):

export OPENAI_API_KEY="your_api_key_here"

Schema Matching

BDI-Kit can help with automatic discovery of one-to-one matches between the attributes/columns in the input (source) dataset and a target dataset schema. The target schema can be either another table or a data model such as the GDC (Genomic Data Commons).

To achieve this using BDI-Kit, we can use the match_schema() function as follows.

[4]:

attribute_matches = bdi.match_schema(source_dataset, target_dataset, method="magneto_ft_llm")
attribute_matches

[4]:

	source_attribute	target_attribute	similarity
0	Num_full_term_pregnancies	Donor_information_number_of_full_term_pregnancies	0.95
1	Histologic_type	Histologic_Type	0.90
2	Gender	Sex	0.90
3	Path_Stage_Reg_Lymph_Nodes-pN	Pathologic_staging_regional_lymph_nodes_pn	0.90
4	Country	Participant_country	0.90
5	Progesterone_Receptor	Ancillary_studies_progesterone_receptor	0.85
6	Histologic_Grade_FIGO	Histologic_grade	0.85
7	FIGO_stage	Pathologic_staging_primary_tumor_pt	0.85

Value Matching

After finding the correct attribute matches, we need to find appropriate value matches. Using match_values(), we can inspect what the possible value matches for this would look like after the harmonization. BDI-Kit implements multiple methods for value matching discovery.

To specify a value matching approach, we can pass the method parameter.

[5]:

value_matches = bdi.match_values(source_dataset, target_dataset, attribute_matches=attribute_matches,  method="llm")
bdi.view_value_matches(value_matches)

Source attribute: Num_full_term_pregnancies
Target attribute: Donor_information_number_of_full_term_pregnancies

	source_value	target_value	similarity
0	Unknown	Unknown	1.0
1	2	2	1.0
2	3	3	1.0
3	4 or more	4 or more	1.0
4	1	1	1.0
5	NaN	NaN	NaN

Source attribute: Histologic_type
Target attribute: Histologic_Type

	source_value	target_value	similarity
0	Endometrioid	Endometrioid carcinoma	1.00
1	Clear cell	Clear cell carcinoma	1.00
2	Serous	Serous carcinoma	0.98
3	Carcinosarcoma	Serous carcinoma	0.75

Source attribute: Gender
Target attribute: Sex

	source_value	target_value	similarity
0	Female	Female	1.0
1	NaN	NaN	NaN

Source attribute: Path_Stage_Reg_Lymph_Nodes-pN
Target attribute: Pathologic_staging_regional_lymph_nodes_pn

	source_value	target_value	similarity
0	pN1 (FIGO IIIC1)	pN1 (FIGO IIIC1)	1.0
1	pN0	pN0	1.0
2	pNX	pNX	1.0
3	pN2 (FIGO IIIC2)	pN2 (FIGO IIIC2)	1.0
4	NaN	NaN	NaN

Source attribute: Country
Target attribute: Participant_country

	source_value	target_value	similarity
0	United States	United States	1.0
1	Ukraine	Ukraine	1.0
2	Poland	Poland	1.0
3	Other_specify	Other: Unknown	0.9
4	NaN	NaN	NaN

Source attribute: Progesterone_Receptor
Target attribute: Ancillary_studies_progesterone_receptor

	source_value	target_value	similarity
0	Cannot be determined	Cannot be determined	1.0
1	Positive	Positive : % Not available	1.0
2	Unknown	Cannot be determined	1.0
3	Negative	Negative	1.0
4	NaN	NaN	NaN

Source attribute: Histologic_Grade_FIGO
Target attribute: Histologic_grade

	source_value	target_value	similarity
0	FIGO grade 3	G3 Poorly differentiated	1.0
1	FIGO grade 1	G1 Well differentiated	1.0
2	FIGO grade 2	G2 Moderately differentiated	1.0
3	NaN	NaN	NaN

Source attribute: FIGO_stage
Target attribute: Pathologic_staging_primary_tumor_pt

	source_value	target_value	similarity
0	IB	pT1b (FIGO IB)	1.00
1	IA	pT1a (FIGO IA)	1.00
2	IIIA	pT3a (FIGO IIIA)	1.00
3	II	pT2 (FIGO II)	1.00
4	IIIB	pT3b (FIGO IIIB)	1.00
5	IVB	pT4 ((FIGO IVA)	0.75
6	IIIC1	pT3b (FIGO IIIB)	0.60
7	IIIC2	pT3b	0.60
8	NaN	NaN	NaN

Materializing the Harmonized Dataset

Next, we generate the harmonization specification and use it to materialize the harmonized dataset. Below is a sample representation of the generated harmonization specification:

[6]:

harmonization_specification = bdi.create_harmonization_spec(value_matches)
harmonization_specification[-2:]

[6]:

[{'source_attribute': 'Histologic_Grade_FIGO',
  'target_attribute': 'Histologic_grade',
  'mapper': {'FIGO grade 3': 'G3 Poorly differentiated', 'FIGO grade 1': 'G1 Well differentiated', 'FIGO grade 2': 'G2 Moderately differentiated', nan: nan}},
 {'source_attribute': 'FIGO_stage',
  'target_attribute': 'Pathologic_staging_primary_tumor_pt',
  'mapper': {'IB': 'pT1b (FIGO IB)', 'IA': 'pT1a (FIGO IA)', 'IIIA': 'pT3a (FIGO IIIA)', 'II': 'pT2 (FIGO II)', 'IIIB': 'pT3b (FIGO IIIB)', 'IIIC1': 'pT3b (FIGO IIIB)', 'IVB': 'pT1b (FIGO IB)', 'IIIC2': 'pT3b (FIGO IIIB)', nan: nan}}]

[7]:

harmonized_dataset = bdi.materialize_mapping(source_dataset, harmonization_specification)
harmonized_dataset.head(5)

[7]:

	Participant_country	Histologic_Type	Pathologic_staging_regional_lymph_nodes_pn	Sex	Ancillary_studies_progesterone_receptor	Donor_information_number_of_full_term_pregnancies	Histologic_grade	Pathologic_staging_primary_tumor_pt
0	United States	Endometrioid carcinoma	pN1 (FIGO IIIC1)	Female	Cannot be determined	Unknown	G3 Poorly differentiated	pT3b (FIGO IIIB)
1	Ukraine	Endometrioid carcinoma	pN0	Female	Positive : % Not available	Unknown	G3 Poorly differentiated	pT1b (FIGO IB)
2	Ukraine	Endometrioid carcinoma	pNX	Female	Positive : % Not available	2	G1 Well differentiated	pT1a (FIGO IA)
3	United States	Endometrioid carcinoma	pNX	Female	Cannot be determined	3	G3 Poorly differentiated	pT4 ((FIGO IVA)
4	Other: Unknown	Endometrioid carcinoma	pNX	Female	Cannot be determined	4 or more	G1 Well differentiated	pT1a (FIGO IA)

Reusing the Harmonization Specification

Instead of recomputing all matches, we reused the existing harmonization specification on a new version of the source dataset.

[8]:

new_dataset = pd.read_csv("https://raw.githubusercontent.com/VIDA-NYU/bdi-kit/refs/heads/devel/examples/datasets/dou_2020_sample_part2.csv")
new_dataset.head(5)

[8]:

	Country	Histologic_type	Path_Stage_Reg_Lymph_Nodes-pN	Gender	Progesterone_Receptor	Num_full_term_pregnancies	Histologic_Grade_FIGO	FIGO_stage
0	Ukraine	Endometrioid	pN0	Female	Positive	1	FIGO grade 2	IA
1	Other_specify	Endometrioid	pN0	Female	Cannot be determined	1	FIGO grade 1	IB
2	Ukraine	Endometrioid	pN0	Female	Positive	1	FIGO grade 2	II
3	Ukraine	Endometrioid	pNX	Female	Positive	NaN	FIGO grade 2	IA
4	United States	Endometrioid	pNX	Female	Cannot be determined	NaN	FIGO grade 1	IA

[9]:

new_harmonized_dataset = bdi.materialize_mapping(new_dataset, harmonization_specification)
new_harmonized_dataset.head(5)

[9]:

	Participant_country	Histologic_Type	Pathologic_staging_regional_lymph_nodes_pn	Sex	Ancillary_studies_progesterone_receptor	Donor_information_number_of_full_term_pregnancies	Histologic_grade	Pathologic_staging_primary_tumor_pt
0	Ukraine	Endometrioid carcinoma	pN0	Female	Positive : % Not available	1	G2 Moderately differentiated	pT1a (FIGO IA)
1	Other: Unknown	Endometrioid carcinoma	pN0	Female	Cannot be determined	1	G1 Well differentiated	pT1b (FIGO IB)
2	Ukraine	Endometrioid carcinoma	pN0	Female	Positive : % Not available	1	G2 Moderately differentiated	pT2 (FIGO II)
3	Ukraine	Endometrioid carcinoma	pNX	Female	Positive : % Not available	NaN	G2 Moderately differentiated	pT1a (FIGO IA)
4	United States	Endometrioid carcinoma	pNX	Female	Cannot be determined	NaN	G1 Well differentiated	pT1a (FIGO IA)