Getting Started

If BDI-Kit is not installed yet, you can install it with:

[ ]:

! pip install bdi-kit

Then import the library:

[1]:

import bdikit as bdi
import pandas as pd

In this example, we are mapping data from Dou et al. (https://pubmed.ncbi.nlm.nih.gov/32059776/) to the GDC format.

[2]:

dataset = pd.read_csv("https://raw.githubusercontent.com/VIDA-NYU/bdi-kit/devel/examples/datasets/dou_2020.csv")

columns = [
    "Country",
    "Histologic_type",
    "FIGO_stage",
    "BMI",
    "Age",
    "Race",
    "Ethnicity",
    "Gender",
    "Tumor_Focality",
    "Tumor_Size_cm",
]

dataset[columns].head(10)

[2]:

	Country	Histologic_type	FIGO_stage	BMI	Age	Race	Ethnicity	Gender	Tumor_Focality	Tumor_Size_cm
0	United States	Endometrioid	IA	38.88	64.0	White	Not-Hispanic or Latino	Female	Unifocal	2.9
1	United States	Endometrioid	IA	39.76	58.0	White	Not-Hispanic or Latino	Female	Unifocal	3.5
2	United States	Endometrioid	IA	51.19	50.0	White	Not-Hispanic or Latino	Female	Unifocal	4.5
3	NaN	Carcinosarcoma	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	United States	Endometrioid	IA	32.69	75.0	White	Not-Hispanic or Latino	Female	Unifocal	3.5
5	United States	Serous	IA	20.28	63.0	White	Not-Hispanic or Latino	Female	Unifocal	6.0
6	United States	Endometrioid	IA	55.67	50.0	White	Not-Hispanic or Latino	Female	Unifocal	4.5
7	Other_specify	Endometrioid	IA	25.68	60.0	White	Not-Hispanic or Latino	Female	Unifocal	5.0
8	United States	Serous	IIIA	21.57	83.0	White	Not-Hispanic or Latino	Female	Unifocal	4.0
9	United States	Endometrioid	IA	34.26	69.0	White	Not-Hispanic or Latino	Female	Unifocal	5.2

Matching the table schema to GDC standard vocabulary

BDI-Kit offers a suite of functions to help with data harmonization tasks. For instance, it can help with automatic discovery of one-to-one mappings between the attributes/columns in the input (source) dataset and a target dataset schema. The target schema can be either another table or a standard data vocabulary such as the GDC (Genomic Data Commons).

To achieve this using BDI-Kit, we can use the match_schema() function to match attributes to the GDC vocabulary schema as follows.

[3]:

attribute_matches = bdi.match_schema(dataset[columns], target="gdc", method="magneto_ft_bp")
attribute_matches

[3]:

	source_attribute	target_attribute	similarity
0	BMI	bmi	1.000000
1	Ethnicity	ethnicity	1.000000
2	Gender	gender	1.000000
3	FIGO_stage	figo_stage	1.000000
4	Tumor_Focality	tumor_focality	1.000000
5	Race	race	1.000000
6	Age	age_at_index	0.988827
7	Country	country_of_birth	0.957011
8	Tumor_Size_cm	tumor_length_measurement	0.905819
9	Histologic_type	histologic_progression_type	0.727179

Generating a harmonized table

After discovering a schema mapping, we can generate a new table (DataFrame) using the new attribute names from the GDC standard vocabulary.

To do so using BDI-Kit, we can use the function materialize_mapping() as follows. Note that the column headers have been renamed to the target schema.

[4]:

bdi.materialize_mapping(dataset, attribute_matches)

[4]:

	bmi	ethnicity	gender	figo_stage	tumor_focality	race	age_at_index	country_of_birth	tumor_length_measurement	histologic_progression_type
0	38.88	Not-Hispanic or Latino	Female	IA	Unifocal	White	64.0	United States	2.9	Endometrioid
1	39.76	Not-Hispanic or Latino	Female	IA	Unifocal	White	58.0	United States	3.5	Endometrioid
2	51.19	Not-Hispanic or Latino	Female	IA	Unifocal	White	50.0	United States	4.5	Endometrioid
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Carcinosarcoma
4	32.69	Not-Hispanic or Latino	Female	IA	Unifocal	White	75.0	United States	3.5	Endometrioid
...	...	...	...	...	...	...	...	...	...	...
99	29.40	NaN	Female	IA	Unifocal	NaN	75.0	Ukraine	4.2	Endometrioid
100	35.42	NaN	Female	II	Unifocal	NaN	74.0	Ukraine	1.5	Endometrioid
101	24.32	Not-Hispanic or Latino	Female	II	Unifocal	Black or African American	85.0	United States	3.8	Serous
102	34.06	NaN	Female	IA	Unifocal	NaN	70.0	Ukraine	5.0	Serous
103	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Ukraine	NaN	Serous

104 rows × 10 columns

Generating a harmonized table with value mappings

BDI-Kit can also help with translation of the values from the source table to the target standard format.

To this end, BDI-Kit provides the function match_values() that automatically creates value mappings for each string column. The output of match_values() can be fed to materialize_mapping() which materialized the final target using both schema and value mappings.

[5]:

value_mappings = bdi.match_values(dataset, target="gdc", attribute_matches=attribute_matches, method="tfidf")
bdi.materialize_mapping(dataset, value_mappings)

[5]:

	ethnicity	gender	tumor_focality	race	country_of_birth	figo_stage	histologic_progression_type
0	not hispanic or latino	female	Unifocal	white	United States	Stage IA	NaN
1	not hispanic or latino	female	Unifocal	white	United States	Stage IA	NaN
2	not hispanic or latino	female	Unifocal	white	United States	Stage IA	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	not hispanic or latino	female	Unifocal	white	United States	Stage IA	NaN
...	...	...	...	...	...	...	...
99	NaN	female	Unifocal	NaN	Ukraine	Stage IA	NaN
100	NaN	female	Unifocal	NaN	Ukraine	Stage III	NaN
101	not hispanic or latino	female	Unifocal	black or african american	United States	Stage III	NaN
102	NaN	female	Unifocal	NaN	Ukraine	Stage IA	NaN
103	NaN	NaN	NaN	NaN	Ukraine	NaN	NaN

104 rows × 7 columns

Verifying the schema mappings

Sometimes the mappings generated automatically may be incorrect or you may to want verify them individually. To verify the suggested attribute matches, you can use BDI-Kit and BDIViz, which offers additional APIs to visualize the data and make any modifications when necessary.

For this example, we will use the column Histologic_type. We can start by exploring the columns most similar to Histologic_type.

For this, we can use the rank_schema_matches() function. Here, we notice that primary_diagnosis could be a potential target column.

[6]:

hist_type_matches = bdi.rank_schema_matches(dataset, target="gdc", attributes=["Histologic_type"])
hist_type_matches

[6]:

	source_attribute	target_attribute	similarity
0	Histologic_type	sample_type	0.554611
1	Histologic_type	history_of_tumor_type	0.542955
2	Histologic_type	primary_diagnosis	0.526662
3	Histologic_type	morphologic_architectural_pattern	0.478859
4	Histologic_type	viral_hepatitis_serologies	0.469105
5	Histologic_type	analyte_type_id	0.462921
6	Histologic_type	histone_variant	0.461407
7	Histologic_type	cog_rhabdomyosarcoma_risk_group	0.425261
8	Histologic_type	tumor_descriptor	0.419358
9	Histologic_type	specimen_type	0.419077

Viewing the attribute domains

To verify that primary_diagnosis is a good target attribute, we view and compare the domains of each attribute using the preview_domain() function. For the source table, it returns the list of unique values in the source attribute. For the GDC target, it returns the list of unique valid values that an attribute can have.

Here we see that the values seem to be related.

[7]:

bdi.preview_domain(dataset, "Histologic_type")

[7]:

	value_name
0	Endometrioid
1	Carcinosarcoma
2	Serous
3	Clear cell

[8]:

bdi.preview_domain("gdc", "primary_diagnosis")

[8]:

	value_name	value_description	attribute_description
0	Abdominal desmoid	An insidious poorly circumscribed neoplasm ari...	Text term used to describe the patient's histo...
1	Abdominal fibromatosis	An insidious poorly circumscribed neoplasm ari...
2	Achromic nevus	A benign nevus characterized by the absence of...
3	Acidophil adenocarcinoma	A malignant epithelial neoplasm of the anterio...
4	Acidophil adenoma	An epithelial neoplasm of the anterior pituita...
...	...	...	...
2620	Wolffian duct tumor	An epithelial neoplasm of the female reproduct...
2621	Xanthofibroma	A benign neoplasm composed of fibroblastic spi...
2622	Yolk sac tumor	A non-seminomatous malignant germ cell tumor c...
2623	Unknown	Not known, not observed, not recorded, or refu...
2624	Not Reported	Not provided or available.

2625 rows × 3 columns

Since primary_diagnosis looks like a correct match for Histologic_type, we can modify the attribute_matches variable directly.

[9]:

attribute_matches.loc[attribute_matches["source_attribute"] == "Histologic_type", "target_attribute"] = "primary_diagnosis"
attribute_matches

[9]:

	source_attribute	target_attribute	similarity
0	BMI	bmi	1.000000
1	Ethnicity	ethnicity	1.000000
2	Gender	gender	1.000000
3	FIGO_stage	figo_stage	1.000000
4	Tumor_Focality	tumor_focality	1.000000
5	Race	race	1.000000
6	Age	age_at_index	0.988827
7	Country	country_of_birth	0.957011
8	Tumor_Size_cm	tumor_length_measurement	0.905819
9	Histologic_type	primary_diagnosis	0.727179

Finding correct value mappings

After finding the correct column, we need to find appropriate value mappings. Using match_values(), we can inspect what the possible value mappings for this would look like after the harmonization.

BDI-Kit implements multiple methods for value mapping discovery, including:

edit_distance - Computes value similarities using Levenstein’s edit distance measure.
tfidf - A method based on tf-idf importance weighting computed over charcter n-grams.
embeddings - Uses BERT word embeddings to compute “semantic similarity” between the values.

To specify a value mapping approach, we can pass the method parameter.

[10]:

bdi.match_values(
    dataset, target="gdc", attribute_matches=("Histologic_type", "primary_diagnosis"), method="edit_distance"
)

[10]:

	source_attribute	target_attribute	source_value	target_value	similarity
0	Histologic_type	primary_diagnosis	Carcinosarcoma	Carcinosarcoma, NOS	0.848485
1	Histologic_type	primary_diagnosis	Clear cell	Clear cell adenoma	0.714286
2	Histologic_type	primary_diagnosis	Endometrioid	Endometrioid adenoma, NOS	0.648649
3	Histologic_type	primary_diagnosis	Serous	Neuronevus	0.625000

[11]:

bdi.match_values(
    dataset, target="gdc", attribute_matches=("Histologic_type", "primary_diagnosis"), method="tfidf"
)

[11]:

	source_attribute	target_attribute	source_value	target_value	similarity
0	Histologic_type	primary_diagnosis	Carcinosarcoma	Carcinosarcoma, NOS	0.969
1	Histologic_type	primary_diagnosis	Endometrioid	Endometrioid adenoma, NOS	0.897
2	Histologic_type	primary_diagnosis	Clear cell	Clear cell adenoma	0.853
3	Histologic_type	primary_diagnosis	Serous	Serous carcinoma, NOS	0.755

[12]:

bdi.match_values(
    dataset, target="gdc", attribute_matches=("Histologic_type", "primary_diagnosis"), method="embedding"
)

[12]:

	source_attribute	target_attribute	source_value	target_value	similarity
0	Histologic_type	primary_diagnosis	Carcinosarcoma	Carcinofibroma	0.897
1	Histologic_type	primary_diagnosis	Clear cell	Clear cell carcinoma	0.773
2	Histologic_type	primary_diagnosis	Endometrioid	Endometrioid cystadenocarcinoma	0.755
3	Histologic_type	primary_diagnosis	Serous	Myoma	0.647

[13]:

hist_type_vmap = pd.DataFrame(
    columns=["source_value", "target_value"],
    data=[
        ("Carcinosarcoma", "Carcinosarcoma, NOS"),
        ("Clear cell", "Clear cell adenocarcinoma, NOS"),
        ("Endometrioid", "Endometrioid carcinoma"),
        ("Serous", "Serous cystadenocarcinoma"),
    ],
)
hist_type_vmap

[13]:

	source_value	target_value
0	Carcinosarcoma	Carcinosarcoma, NOS
1	Clear cell	Clear cell adenocarcinoma, NOS
2	Endometrioid	Endometrioid carcinoma
3	Serous	Serous cystadenocarcinoma

Verifying multiple value mappings at once

Besides verifying value mappings individually, you can also do it for all column mappings at once.

[14]:

mappings = bdi.match_values(
    dataset,
    target="gdc",
    attribute_matches=attribute_matches,
    method="tfidf",
    output_format="list"
)

for mapping in mappings:
    print(f"{mapping.attrs['source_attribute']} => {mapping.attrs['target_attribute']}")
    display(mapping)
    print("")

Ethnicity => ethnicity

	source_value	target_value	similarity
0	Hispanic or Latino	hispanic or latino	1.000
1	Not reported	not reported	1.000
2	Not-Hispanic or Latino	not hispanic or latino	0.936
3	NaN	NaN	NaN


Gender => gender

	source_value	target_value	similarity
0	Female	female	1.0
1	NaN	NaN	NaN


Tumor_Focality => tumor_focality

	source_value	target_value	similarity
0	Unifocal	Unifocal	1.0
1	Multifocal	Multifocal	1.0
2	NaN	NaN	NaN


Race => race

	source_value	target_value	similarity
0	White	white	1.0
1	White	white	1.0
2	Asian	asian	1.0
3	Not Reported	not reported	1.0
4	Black or African American	black or african american	1.0
5	NaN	NaN	NaN


Country => country_of_birth

	source_value	target_value	similarity
0	United States	United States	1.0
1	Ukraine	Ukraine	1.0
2	Poland	Poland	1.0
3	NaN	NaN	NaN
4	Other_specify	NaN	NaN


Histologic_type => primary_diagnosis

	source_value	target_value	similarity
0	Carcinosarcoma	Carcinosarcoma, NOS	0.969
1	Endometrioid	Endometrioid adenoma, NOS	0.897
2	Clear cell	Clear cell adenoma	0.853
3	Serous	Serous carcinoma, NOS	0.755


FIGO_stage => figo_stage

	source_value	target_value	similarity
0	IIIC2	Stage IIIC2	0.890
1	IIIC1	Stage IIIC1	0.890
2	IVB	Stage IVB	0.856
3	IIIB	Stage IIIB	0.850
4	IIIA	Stage IIIA	0.823
5	II	Stage III	0.686
6	IB	Stage IB	0.651
7	IA	Stage IA	0.591
8	NaN	NaN	NaN

Fixing remaining value mappings

We need fix a few value mappings:

Tumor_Site

For Tumor_Site, given that this dataset is about endometrial cancer, all values must be mapped to “Endometrium”. So instead of fixing each mapping individually, we will write a custom function that returns “Endometrium” regardless of the input value. Later, we will show how to use this function to transform the dataset.

[15]:

bdi.match_values(
    dataset, target="gdc", attribute_matches=("Tumor_Site", "tissue_or_organ_of_origin"), method="tfidf"
)

[15]:

	source_attribute	target_attribute	source_value	target_value	similarity
0	Tumor_Site	tissue_or_organ_of_origin	Anterior endometrium	Endometrium	0.852
1	Tumor_Site	tissue_or_organ_of_origin	Posterior endometrium	Endometrium	0.823
2	Tumor_Site	tissue_or_organ_of_origin	Other, specify	Other specified parts of pancreas	0.543
3	Tumor_Site	tissue_or_organ_of_origin	NaN	NaN	NaN

[16]:

# Custom mapping function that will be used to map the values of the 'Tumor_Site' column
def map_tumor_site(source_value):
    return "Endometrium"

Combining custom user mappings with suggested mappings

Before generating a final harmonized dataset, we can combine the automatically generated value mappings with the fixed mappings provided by the user. To do so, we use bdi.create_harmonization_spec() function, which take a list of mappings (e.g., generated automatically) and a list of “user-defined mapping overrides” that will be combined with the first list of mappings and will take precedence whenever they conflict.

In our example below, all mappings specified in the variable user_mappings will override the mappings in value_mappings generated by the bdi.match_values() function.

[17]:

from math import ceil

user_mappings = [
    {
        # When no mapping is need, specifying the source and target is enough
        "source_attribute": "BMI",
        "target_attribute": "bmi",
    },
    {
        "source_attribute": "Tumor_Size_cm",
        "target_attribute": "tumor_largest_dimension_diameter",
    },
    {
        # mapper can be a custom Python function
        "source_attribute": "Tumor_Site",
        "target_attribute": "tissue_or_organ_of_origin",
        "mapper": map_tumor_site,
    },
    {
        # Lambda functions can also be used as mappers
        "source_attribute": "Age",
        "target_attribute": "days_to_birth",
        "mapper": lambda age: -age * 365.25,
    },
    {
        "source_attribute": "Age",
        "target_attribute": "age_at_diagnosis",
        "mapper": lambda age: float("nan") if pd.isnull(age) else ceil(age*365.25),
    },
    {
        # We can also use a data frame to specify value mappings using the `matches` attribute
        "source_attribute": "Histologic_type",
        "target_attribute": "primary_diagnosis",
        "matches": hist_type_vmap
    }
]

harmonization_spec = bdi.create_harmonization_spec(value_mappings, user_mappings)

Finally, we generate the harmonized dataset, with the user-defined value mappings.

[18]:

harmonized_dataset = bdi.materialize_mapping(dataset, harmonization_spec)
harmonized_dataset

[18]:

	bmi	tumor_largest_dimension_diameter	tissue_or_organ_of_origin	days_to_birth	age_at_diagnosis	primary_diagnosis	ethnicity	gender	tumor_focality	race	country_of_birth	figo_stage	histologic_progression_type
0	38.88	2.9	Endometrium	-23376.00	23376.0	Endometrioid carcinoma	not hispanic or latino	female	Unifocal	white	United States	Stage IA	NaN
1	39.76	3.5	Endometrium	-21184.50	21185.0	Endometrioid carcinoma	not hispanic or latino	female	Unifocal	white	United States	Stage IA	NaN
2	51.19	4.5	Endometrium	-18262.50	18263.0	Endometrioid carcinoma	not hispanic or latino	female	Unifocal	white	United States	Stage IA	NaN
3	NaN	NaN	Endometrium	NaN	NaN	Carcinosarcoma, NOS	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	32.69	3.5	Endometrium	-27393.75	27394.0	Endometrioid carcinoma	not hispanic or latino	female	Unifocal	white	United States	Stage IA	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...
99	29.40	4.2	Endometrium	-27393.75	27394.0	Endometrioid carcinoma	NaN	female	Unifocal	NaN	Ukraine	Stage IA	NaN
100	35.42	1.5	Endometrium	-27028.50	27029.0	Endometrioid carcinoma	NaN	female	Unifocal	NaN	Ukraine	Stage III	NaN
101	24.32	3.8	Endometrium	-31046.25	31047.0	Serous cystadenocarcinoma	not hispanic or latino	female	Unifocal	black or african american	United States	Stage III	NaN
102	34.06	5.0	Endometrium	-25567.50	25568.0	Serous cystadenocarcinoma	NaN	female	Unifocal	NaN	Ukraine	Stage IA	NaN
103	NaN	NaN	Endometrium	NaN	NaN	Serous cystadenocarcinoma	NaN	NaN	NaN	NaN	Ukraine	NaN	NaN

104 rows × 13 columns

For comparison, here is how our original data looked like:

[19]:

original_columns = map(lambda m: m["source_attribute"], harmonization_spec)
dataset[original_columns]

[19]:

	BMI	Tumor_Size_cm	Tumor_Site	Age	Age	Histologic_type	Ethnicity	Gender	Tumor_Focality	Race	Country	FIGO_stage	Histologic_type
0	38.88	2.9	Anterior endometrium	64.0	64.0	Endometrioid	Not-Hispanic or Latino	Female	Unifocal	White	United States	IA	Endometrioid
1	39.76	3.5	Posterior endometrium	58.0	58.0	Endometrioid	Not-Hispanic or Latino	Female	Unifocal	White	United States	IA	Endometrioid
2	51.19	4.5	Other, specify	50.0	50.0	Endometrioid	Not-Hispanic or Latino	Female	Unifocal	White	United States	IA	Endometrioid
3	NaN	NaN	NaN	NaN	NaN	Carcinosarcoma	NaN	NaN	NaN	NaN	NaN	NaN	Carcinosarcoma
4	32.69	3.5	Other, specify	75.0	75.0	Endometrioid	Not-Hispanic or Latino	Female	Unifocal	White	United States	IA	Endometrioid
...	...	...	...	...	...	...	...	...	...	...	...	...	...
99	29.40	4.2	Other, specify	75.0	75.0	Endometrioid	NaN	Female	Unifocal	NaN	Ukraine	IA	Endometrioid
100	35.42	1.5	Other, specify	74.0	74.0	Endometrioid	NaN	Female	Unifocal	NaN	Ukraine	II	Endometrioid
101	24.32	3.8	Other, specify	85.0	85.0	Serous	Not-Hispanic or Latino	Female	Unifocal	Black or African American	United States	II	Serous
102	34.06	5.0	Other, specify	70.0	70.0	Serous	NaN	Female	Unifocal	NaN	Ukraine	IA	Serous
103	NaN	NaN	NaN	NaN	NaN	Serous	NaN	NaN	NaN	NaN	Ukraine	NaN	Serous

104 rows × 13 columns