Getting Started

If BDI-Kit is not installed yet, you can install it with:

[ ]:
! pip install bdi-kit

Then import the library:

[1]:
import bdikit as bdi
import pandas as pd

In this example, we are mapping data from Dou et al. (https://pubmed.ncbi.nlm.nih.gov/32059776/) to the GDC format.

[2]:
dataset = pd.read_csv("https://raw.githubusercontent.com/VIDA-NYU/bdi-kit/devel/examples/datasets/dou_2020.csv")

columns = [
    "Country",
    "Histologic_type",
    "FIGO_stage",
    "BMI",
    "Age",
    "Race",
    "Ethnicity",
    "Gender",
    "Tumor_Focality",
    "Tumor_Size_cm",
]

dataset[columns].head(10)
[2]:
Country Histologic_type FIGO_stage BMI Age Race Ethnicity Gender Tumor_Focality Tumor_Size_cm
0 United States Endometrioid IA 38.88 64.0 White Not-Hispanic or Latino Female Unifocal 2.9
1 United States Endometrioid IA 39.76 58.0 White Not-Hispanic or Latino Female Unifocal 3.5
2 United States Endometrioid IA 51.19 50.0 White Not-Hispanic or Latino Female Unifocal 4.5
3 NaN Carcinosarcoma NaN NaN NaN NaN NaN NaN NaN NaN
4 United States Endometrioid IA 32.69 75.0 White Not-Hispanic or Latino Female Unifocal 3.5
5 United States Serous IA 20.28 63.0 White Not-Hispanic or Latino Female Unifocal 6.0
6 United States Endometrioid IA 55.67 50.0 White Not-Hispanic or Latino Female Unifocal 4.5
7 Other_specify Endometrioid IA 25.68 60.0 White Not-Hispanic or Latino Female Unifocal 5.0
8 United States Serous IIIA 21.57 83.0 White Not-Hispanic or Latino Female Unifocal 4.0
9 United States Endometrioid IA 34.26 69.0 White Not-Hispanic or Latino Female Unifocal 5.2

Matching the table schema to GDC standard vocabulary

BDI-Kit offers a suite of functions to help with data harmonization tasks. For instance, it can help with automatic discovery of one-to-one mappings between the attributes/columns in the input (source) dataset and a target dataset schema. The target schema can be either another table or a standard data vocabulary such as the GDC (Genomic Data Commons).

To achieve this using BDI-Kit, we can use the match_schema() function to match attributes to the GDC vocabulary schema as follows.

[3]:
attribute_matches = bdi.match_schema(dataset[columns], target="gdc", method="magneto_ft_bp")
attribute_matches
[3]:
source_attribute target_attribute similarity
0 BMI bmi 1.000000
1 Ethnicity ethnicity 1.000000
2 Gender gender 1.000000
3 FIGO_stage figo_stage 1.000000
4 Tumor_Focality tumor_focality 1.000000
5 Race race 1.000000
6 Age age_at_index 0.988827
7 Country country_of_birth 0.957011
8 Tumor_Size_cm tumor_length_measurement 0.905819
9 Histologic_type histologic_progression_type 0.727179

Generating a harmonized table

After discovering a schema mapping, we can generate a new table (DataFrame) using the new attribute names from the GDC standard vocabulary.

To do so using BDI-Kit, we can use the function materialize_mapping() as follows. Note that the column headers have been renamed to the target schema.

[4]:
bdi.materialize_mapping(dataset, attribute_matches)
[4]:
bmi ethnicity gender figo_stage tumor_focality race age_at_index country_of_birth tumor_length_measurement histologic_progression_type
0 38.88 Not-Hispanic or Latino Female IA Unifocal White 64.0 United States 2.9 Endometrioid
1 39.76 Not-Hispanic or Latino Female IA Unifocal White 58.0 United States 3.5 Endometrioid
2 51.19 Not-Hispanic or Latino Female IA Unifocal White 50.0 United States 4.5 Endometrioid
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN Carcinosarcoma
4 32.69 Not-Hispanic or Latino Female IA Unifocal White 75.0 United States 3.5 Endometrioid
... ... ... ... ... ... ... ... ... ... ...
99 29.40 NaN Female IA Unifocal NaN 75.0 Ukraine 4.2 Endometrioid
100 35.42 NaN Female II Unifocal NaN 74.0 Ukraine 1.5 Endometrioid
101 24.32 Not-Hispanic or Latino Female II Unifocal Black or African American 85.0 United States 3.8 Serous
102 34.06 NaN Female IA Unifocal NaN 70.0 Ukraine 5.0 Serous
103 NaN NaN NaN NaN NaN NaN NaN Ukraine NaN Serous

104 rows × 10 columns

Generating a harmonized table with value mappings

BDI-Kit can also help with translation of the values from the source table to the target standard format.

To this end, BDI-Kit provides the function match_values() that automatically creates value mappings for each string column. The output of match_values() can be fed to materialize_mapping() which materialized the final target using both schema and value mappings.

[5]:
value_mappings = bdi.match_values(dataset, target="gdc", attribute_matches=attribute_matches, method="tfidf")
bdi.materialize_mapping(dataset, value_mappings)
[5]:
ethnicity gender tumor_focality race country_of_birth figo_stage histologic_progression_type
0 not hispanic or latino female Unifocal white United States Stage IA NaN
1 not hispanic or latino female Unifocal white United States Stage IA NaN
2 not hispanic or latino female Unifocal white United States Stage IA NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 not hispanic or latino female Unifocal white United States Stage IA NaN
... ... ... ... ... ... ... ...
99 NaN female Unifocal NaN Ukraine Stage IA NaN
100 NaN female Unifocal NaN Ukraine Stage III NaN
101 not hispanic or latino female Unifocal black or african american United States Stage III NaN
102 NaN female Unifocal NaN Ukraine Stage IA NaN
103 NaN NaN NaN NaN Ukraine NaN NaN

104 rows × 7 columns

Verifying the schema mappings

Sometimes the mappings generated automatically may be incorrect or you may to want verify them individually. To verify the suggested attribute matches, you can use BDI-Kit and BDIViz, which offers additional APIs to visualize the data and make any modifications when necessary.

For this example, we will use the column Histologic_type. We can start by exploring the columns most similar to Histologic_type.

For this, we can use the rank_schema_matches() function. Here, we notice that primary_diagnosis could be a potential target column.

[6]:
hist_type_matches = bdi.rank_schema_matches(dataset, target="gdc", attributes=["Histologic_type"])
hist_type_matches
[6]:
source_attribute target_attribute similarity
0 Histologic_type sample_type 0.554611
1 Histologic_type history_of_tumor_type 0.542955
2 Histologic_type primary_diagnosis 0.526662
3 Histologic_type morphologic_architectural_pattern 0.478859
4 Histologic_type viral_hepatitis_serologies 0.469105
5 Histologic_type analyte_type_id 0.462921
6 Histologic_type histone_variant 0.461407
7 Histologic_type cog_rhabdomyosarcoma_risk_group 0.425261
8 Histologic_type tumor_descriptor 0.419358
9 Histologic_type specimen_type 0.419077

Viewing the attribute domains

To verify that primary_diagnosis is a good target attribute, we view and compare the domains of each attribute using the preview_domain() function. For the source table, it returns the list of unique values in the source attribute. For the GDC target, it returns the list of unique valid values that an attribute can have.

Here we see that the values seem to be related.

[7]:
bdi.preview_domain(dataset, "Histologic_type")
[7]:
value_name
0 Endometrioid
1 Carcinosarcoma
2 Serous
3 Clear cell
[8]:
bdi.preview_domain("gdc", "primary_diagnosis")
[8]:
value_name value_description attribute_description
0 Abdominal desmoid An insidious poorly circumscribed neoplasm ari... Text term used to describe the patient's histo...
1 Abdominal fibromatosis An insidious poorly circumscribed neoplasm ari...
2 Achromic nevus A benign nevus characterized by the absence of...
3 Acidophil adenocarcinoma A malignant epithelial neoplasm of the anterio...
4 Acidophil adenoma An epithelial neoplasm of the anterior pituita...
... ... ... ...
2620 Wolffian duct tumor An epithelial neoplasm of the female reproduct...
2621 Xanthofibroma A benign neoplasm composed of fibroblastic spi...
2622 Yolk sac tumor A non-seminomatous malignant germ cell tumor c...
2623 Unknown Not known, not observed, not recorded, or refu...
2624 Not Reported Not provided or available.

2625 rows × 3 columns

Since primary_diagnosis looks like a correct match for Histologic_type, we can modify the attribute_matches variable directly.

[9]:
attribute_matches.loc[attribute_matches["source_attribute"] == "Histologic_type", "target_attribute"] = "primary_diagnosis"
attribute_matches
[9]:
source_attribute target_attribute similarity
0 BMI bmi 1.000000
1 Ethnicity ethnicity 1.000000
2 Gender gender 1.000000
3 FIGO_stage figo_stage 1.000000
4 Tumor_Focality tumor_focality 1.000000
5 Race race 1.000000
6 Age age_at_index 0.988827
7 Country country_of_birth 0.957011
8 Tumor_Size_cm tumor_length_measurement 0.905819
9 Histologic_type primary_diagnosis 0.727179

Finding correct value mappings

After finding the correct column, we need to find appropriate value mappings. Using match_values(), we can inspect what the possible value mappings for this would look like after the harmonization.

BDI-Kit implements multiple methods for value mapping discovery, including:

  • edit_distance - Computes value similarities using Levenstein’s edit distance measure.

  • tfidf - A method based on tf-idf importance weighting computed over charcter n-grams.

  • embeddings - Uses BERT word embeddings to compute “semantic similarity” between the values.

To specify a value mapping approach, we can pass the method parameter.

[10]:
bdi.match_values(
    dataset, target="gdc", attribute_matches=("Histologic_type", "primary_diagnosis"), method="edit_distance"
)
[10]:
source_attribute target_attribute source_value target_value similarity
0 Histologic_type primary_diagnosis Carcinosarcoma Carcinosarcoma, NOS 0.848485
1 Histologic_type primary_diagnosis Clear cell Clear cell adenoma 0.714286
2 Histologic_type primary_diagnosis Endometrioid Endometrioid adenoma, NOS 0.648649
3 Histologic_type primary_diagnosis Serous Neuronevus 0.625000
[11]:
bdi.match_values(
    dataset, target="gdc", attribute_matches=("Histologic_type", "primary_diagnosis"), method="tfidf"
)
[11]:
source_attribute target_attribute source_value target_value similarity
0 Histologic_type primary_diagnosis Carcinosarcoma Carcinosarcoma, NOS 0.969
1 Histologic_type primary_diagnosis Endometrioid Endometrioid adenoma, NOS 0.897
2 Histologic_type primary_diagnosis Clear cell Clear cell adenoma 0.853
3 Histologic_type primary_diagnosis Serous Serous carcinoma, NOS 0.755
[12]:
bdi.match_values(
    dataset, target="gdc", attribute_matches=("Histologic_type", "primary_diagnosis"), method="embedding"
)
[12]:
source_attribute target_attribute source_value target_value similarity
0 Histologic_type primary_diagnosis Carcinosarcoma Carcinofibroma 0.897
1 Histologic_type primary_diagnosis Clear cell Clear cell carcinoma 0.773
2 Histologic_type primary_diagnosis Endometrioid Endometrioid cystadenocarcinoma 0.755
3 Histologic_type primary_diagnosis Serous Myoma 0.647
[13]:
hist_type_vmap = pd.DataFrame(
    columns=["source_value", "target_value"],
    data=[
        ("Carcinosarcoma", "Carcinosarcoma, NOS"),
        ("Clear cell", "Clear cell adenocarcinoma, NOS"),
        ("Endometrioid", "Endometrioid carcinoma"),
        ("Serous", "Serous cystadenocarcinoma"),
    ],
)
hist_type_vmap
[13]:
source_value target_value
0 Carcinosarcoma Carcinosarcoma, NOS
1 Clear cell Clear cell adenocarcinoma, NOS
2 Endometrioid Endometrioid carcinoma
3 Serous Serous cystadenocarcinoma

Verifying multiple value mappings at once

Besides verifying value mappings individually, you can also do it for all column mappings at once.

[14]:
mappings = bdi.match_values(
    dataset,
    target="gdc",
    attribute_matches=attribute_matches,
    method="tfidf",
    output_format="list"
)

for mapping in mappings:
    print(f"{mapping.attrs['source_attribute']} => {mapping.attrs['target_attribute']}")
    display(mapping)
    print("")
Ethnicity => ethnicity
source_value target_value similarity
0 Hispanic or Latino hispanic or latino 1.000
1 Not reported not reported 1.000
2 Not-Hispanic or Latino not hispanic or latino 0.936
3 NaN NaN NaN

Gender => gender
source_value target_value similarity
0 Female female 1.0
1 NaN NaN NaN

Tumor_Focality => tumor_focality
source_value target_value similarity
0 Unifocal Unifocal 1.0
1 Multifocal Multifocal 1.0
2 NaN NaN NaN

Race => race
source_value target_value similarity
0 White white 1.0
1 White white 1.0
2 Asian asian 1.0
3 Not Reported not reported 1.0
4 Black or African American black or african american 1.0
5 NaN NaN NaN

Country => country_of_birth
source_value target_value similarity
0 United States United States 1.0
1 Ukraine Ukraine 1.0
2 Poland Poland 1.0
3 NaN NaN NaN
4 Other_specify NaN NaN

Histologic_type => primary_diagnosis
source_value target_value similarity
0 Carcinosarcoma Carcinosarcoma, NOS 0.969
1 Endometrioid Endometrioid adenoma, NOS 0.897
2 Clear cell Clear cell adenoma 0.853
3 Serous Serous carcinoma, NOS 0.755

FIGO_stage => figo_stage
source_value target_value similarity
0 IIIC2 Stage IIIC2 0.890
1 IIIC1 Stage IIIC1 0.890
2 IVB Stage IVB 0.856
3 IIIB Stage IIIB 0.850
4 IIIA Stage IIIA 0.823
5 II Stage III 0.686
6 IB Stage IB 0.651
7 IA Stage IA 0.591
8 NaN NaN NaN

Fixing remaining value mappings

We need fix a few value mappings:

  • Tumor_Site

For Tumor_Site, given that this dataset is about endometrial cancer, all values must be mapped to “Endometrium”. So instead of fixing each mapping individually, we will write a custom function that returns “Endometrium” regardless of the input value. Later, we will show how to use this function to transform the dataset.

[15]:
bdi.match_values(
    dataset, target="gdc", attribute_matches=("Tumor_Site", "tissue_or_organ_of_origin"), method="tfidf"
)
[15]:
source_attribute target_attribute source_value target_value similarity
0 Tumor_Site tissue_or_organ_of_origin Anterior endometrium Endometrium 0.852
1 Tumor_Site tissue_or_organ_of_origin Posterior endometrium Endometrium 0.823
2 Tumor_Site tissue_or_organ_of_origin Other, specify Other specified parts of pancreas 0.543
3 Tumor_Site tissue_or_organ_of_origin NaN NaN NaN
[16]:
# Custom mapping function that will be used to map the values of the 'Tumor_Site' column
def map_tumor_site(source_value):
    return "Endometrium"

Combining custom user mappings with suggested mappings

Before generating a final harmonized dataset, we can combine the automatically generated value mappings with the fixed mappings provided by the user. To do so, we use bdi.create_harmonization_spec() function, which take a list of mappings (e.g., generated automatically) and a list of “user-defined mapping overrides” that will be combined with the first list of mappings and will take precedence whenever they conflict.

In our example below, all mappings specified in the variable user_mappings will override the mappings in value_mappings generated by the bdi.match_values() function.

[17]:
from math import ceil

user_mappings = [
    {
        # When no mapping is need, specifying the source and target is enough
        "source_attribute": "BMI",
        "target_attribute": "bmi",
    },
    {
        "source_attribute": "Tumor_Size_cm",
        "target_attribute": "tumor_largest_dimension_diameter",
    },
    {
        # mapper can be a custom Python function
        "source_attribute": "Tumor_Site",
        "target_attribute": "tissue_or_organ_of_origin",
        "mapper": map_tumor_site,
    },
    {
        # Lambda functions can also be used as mappers
        "source_attribute": "Age",
        "target_attribute": "days_to_birth",
        "mapper": lambda age: -age * 365.25,
    },
    {
        "source_attribute": "Age",
        "target_attribute": "age_at_diagnosis",
        "mapper": lambda age: float("nan") if pd.isnull(age) else ceil(age*365.25),
    },
    {
        # We can also use a data frame to specify value mappings using the `matches` attribute
        "source_attribute": "Histologic_type",
        "target_attribute": "primary_diagnosis",
        "matches": hist_type_vmap
    }
]

harmonization_spec = bdi.create_harmonization_spec(value_mappings, user_mappings)

Finally, we generate the harmonized dataset, with the user-defined value mappings.

[18]:
harmonized_dataset = bdi.materialize_mapping(dataset, harmonization_spec)
harmonized_dataset
[18]:
bmi tumor_largest_dimension_diameter tissue_or_organ_of_origin days_to_birth age_at_diagnosis primary_diagnosis ethnicity gender tumor_focality race country_of_birth figo_stage histologic_progression_type
0 38.88 2.9 Endometrium -23376.00 23376.0 Endometrioid carcinoma not hispanic or latino female Unifocal white United States Stage IA NaN
1 39.76 3.5 Endometrium -21184.50 21185.0 Endometrioid carcinoma not hispanic or latino female Unifocal white United States Stage IA NaN
2 51.19 4.5 Endometrium -18262.50 18263.0 Endometrioid carcinoma not hispanic or latino female Unifocal white United States Stage IA NaN
3 NaN NaN Endometrium NaN NaN Carcinosarcoma, NOS NaN NaN NaN NaN NaN NaN NaN
4 32.69 3.5 Endometrium -27393.75 27394.0 Endometrioid carcinoma not hispanic or latino female Unifocal white United States Stage IA NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
99 29.40 4.2 Endometrium -27393.75 27394.0 Endometrioid carcinoma NaN female Unifocal NaN Ukraine Stage IA NaN
100 35.42 1.5 Endometrium -27028.50 27029.0 Endometrioid carcinoma NaN female Unifocal NaN Ukraine Stage III NaN
101 24.32 3.8 Endometrium -31046.25 31047.0 Serous cystadenocarcinoma not hispanic or latino female Unifocal black or african american United States Stage III NaN
102 34.06 5.0 Endometrium -25567.50 25568.0 Serous cystadenocarcinoma NaN female Unifocal NaN Ukraine Stage IA NaN
103 NaN NaN Endometrium NaN NaN Serous cystadenocarcinoma NaN NaN NaN NaN Ukraine NaN NaN

104 rows × 13 columns

For comparison, here is how our original data looked like:

[19]:
original_columns = map(lambda m: m["source_attribute"], harmonization_spec)
dataset[original_columns]
[19]:
BMI Tumor_Size_cm Tumor_Site Age Age Histologic_type Ethnicity Gender Tumor_Focality Race Country FIGO_stage Histologic_type
0 38.88 2.9 Anterior endometrium 64.0 64.0 Endometrioid Not-Hispanic or Latino Female Unifocal White United States IA Endometrioid
1 39.76 3.5 Posterior endometrium 58.0 58.0 Endometrioid Not-Hispanic or Latino Female Unifocal White United States IA Endometrioid
2 51.19 4.5 Other, specify 50.0 50.0 Endometrioid Not-Hispanic or Latino Female Unifocal White United States IA Endometrioid
3 NaN NaN NaN NaN NaN Carcinosarcoma NaN NaN NaN NaN NaN NaN Carcinosarcoma
4 32.69 3.5 Other, specify 75.0 75.0 Endometrioid Not-Hispanic or Latino Female Unifocal White United States IA Endometrioid
... ... ... ... ... ... ... ... ... ... ... ... ... ...
99 29.40 4.2 Other, specify 75.0 75.0 Endometrioid NaN Female Unifocal NaN Ukraine IA Endometrioid
100 35.42 1.5 Other, specify 74.0 74.0 Endometrioid NaN Female Unifocal NaN Ukraine II Endometrioid
101 24.32 3.8 Other, specify 85.0 85.0 Serous Not-Hispanic or Latino Female Unifocal Black or African American United States II Serous
102 34.06 5.0 Other, specify 70.0 70.0 Serous NaN Female Unifocal NaN Ukraine IA Serous
103 NaN NaN NaN NaN NaN Serous NaN NaN NaN NaN Ukraine NaN Serous

104 rows × 13 columns