Getting Started

First, import the bdikit library.

[1]:
import bdikit as bdi
import pandas as pd

In this example, we are mapping data from Dou et al. (https://pubmed.ncbi.nlm.nih.gov/37567170/) to the GDC format.

[2]:
dataset = pd.read_csv("./datasets/dou.csv")

columns = [
    "Country",
    "Histologic_type",
    "FIGO_stage",
    "BMI",
    "Age",
    "Race",
    "Ethnicity",
    "Gender",
    "Tumor_Focality",
    "Tumor_Size_cm",
]

dataset[columns].head(10)
[2]:
Country Histologic_type FIGO_stage BMI Age Race Ethnicity Gender Tumor_Focality Tumor_Size_cm
0 United States Endometrioid IA 38.88 64.0 White Not-Hispanic or Latino Female Unifocal 2.9
1 United States Endometrioid IA 39.76 58.0 White Not-Hispanic or Latino Female Unifocal 3.5
2 United States Endometrioid IA 51.19 50.0 White Not-Hispanic or Latino Female Unifocal 4.5
3 NaN Carcinosarcoma NaN NaN NaN NaN NaN NaN NaN NaN
4 United States Endometrioid IA 32.69 75.0 White Not-Hispanic or Latino Female Unifocal 3.5
5 United States Serous IA 20.28 63.0 White Not-Hispanic or Latino Female Unifocal 6.0
6 United States Endometrioid IA 55.67 50.0 White Not-Hispanic or Latino Female Unifocal 4.5
7 Other_specify Endometrioid IA 25.68 60.0 White Not-Hispanic or Latino Female Unifocal 5.0
8 United States Serous IIIA 21.57 83.0 White Not-Hispanic or Latino Female Unifocal 4.0
9 United States Endometrioid IA 34.26 69.0 White Not-Hispanic or Latino Female Unifocal 5.2

Matching the table schema to GDC standard vocabulary

bdi-kit offers a suite of functions to help with data harmonization tasks. For instance, it can help with automatic discovery of one-to-one mappings between the columns in the input (source) dataset and a target dataset schema. The target schema can be either another table or a standard data vocabulary such as the GDC (Genomic Data Commons).

To achieve this using bdi-kit, we can use the match_schema() function to match columns to the GDC vocabulary schema as follows.

[3]:
column_mappings = bdi.match_schema(dataset[columns], target="gdc", method="two_phase")
column_mappings
  0%|          | 0/10 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
100%|██████████| 10/10 [00:01<00:00,  9.84it/s]
Table features extracted from 10 columns
100%|██████████| 734/734 [01:03<00:00, 11.51it/s]
Table features extracted from 734 columns
[3]:
source target
0 Country country_of_birth
1 Histologic_type dysplasia_type
2 FIGO_stage figo_stage
3 BMI hpv_positive_type
4 Age weight
5 Race race
6 Ethnicity ethnicity
7 Gender gender
8 Tumor_Focality tumor_focality
9 Tumor_Size_cm tumor_depth

Generating a harmonized table

After discovering a schema mapping, we can generate a new table (DataFrame) using the new column names from the GDC standard vocabulary.

To do so using bdi-kit, we can use the function materialize_mapping() as follows. Note that the column headers have been renamed to the target schema.

[4]:
bdi.materialize_mapping(dataset, column_mappings)
[4]:
country_of_birth dysplasia_type figo_stage hpv_positive_type weight race ethnicity gender tumor_focality tumor_depth
0 United States Endometrioid IA 38.88 64.0 White Not-Hispanic or Latino Female Unifocal 2.9
1 United States Endometrioid IA 39.76 58.0 White Not-Hispanic or Latino Female Unifocal 3.5
2 United States Endometrioid IA 51.19 50.0 White Not-Hispanic or Latino Female Unifocal 4.5
3 NaN Carcinosarcoma NaN NaN NaN NaN NaN NaN NaN NaN
4 United States Endometrioid IA 32.69 75.0 White Not-Hispanic or Latino Female Unifocal 3.5
... ... ... ... ... ... ... ... ... ... ...
99 Ukraine Endometrioid IA 29.40 75.0 NaN NaN Female Unifocal 4.2
100 Ukraine Endometrioid II 35.42 74.0 NaN NaN Female Unifocal 1.5
101 United States Serous II 24.32 85.0 Black or African American Not-Hispanic or Latino Female Unifocal 3.8
102 Ukraine Serous IA 34.06 70.0 NaN NaN Female Unifocal 5.0
103 Ukraine Serous NaN NaN NaN NaN NaN NaN NaN NaN

104 rows × 10 columns

Generating a harmonized table with value mappings

bdi-kit can also help with translation of the values from the source table to the target standard format.

To this end, bdi-kit provides the function match_values() that automatically creates value mappings for each string column. The output of match_values() can be fed to materialize_mapping() which materialized the final target using both schema and value mappings.

[5]:
value_mappings = bdi.match_values(dataset, column_mapping=column_mappings, target="gdc", method="tfidf")
bdi.materialize_mapping(dataset, value_mappings)
[5]:
country_of_birth dysplasia_type figo_stage race ethnicity gender tumor_focality
0 United States None Stage IA white not hispanic or latino female Unifocal
1 United States None Stage IA white not hispanic or latino female Unifocal
2 United States None Stage IA white not hispanic or latino female Unifocal
3 NaN Esophageal Mucosa Columnar Dysplasia NaN NaN NaN NaN NaN
4 United States None Stage IA white not hispanic or latino female Unifocal
... ... ... ... ... ... ... ...
99 Ukraine None Stage IA NaN NaN female Unifocal
100 Ukraine None Stage III NaN NaN female Unifocal
101 United States None Stage III black or african american not hispanic or latino female Unifocal
102 Ukraine None Stage IA NaN NaN female Unifocal
103 Ukraine None NaN NaN NaN NaN NaN

104 rows × 7 columns

Verifying the schema mappings

Sometimes the mappings generated automatically may be incorrect or you may to want verify them individually. To verify the suggested column mappings, you can use bdi-kit and bdi-viz, which offers additional APIs to visualize the data and make any modifications when necessary.

For this example, we will use the column Histologic_type. We can start by exploring the columns most similar to Histologic_type.

For this, we can use the top_matches() function. Here, we notice that primary_diagnosis could be a potential target column.

[6]:
hist_type_matches = bdi.top_matches(dataset, columns=["Histologic_type"], target="gdc")
hist_type_matches
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 1/1 [00:00<00:00, 14.70it/s]
Table features extracted from 1 columns
100%|██████████| 734/734 [00:58<00:00, 12.48it/s]
Table features extracted from 734 columns

[6]:
source target similarity
0 Histologic_type described_cases 0.589956
1 Histologic_type slide_images 0.587552
2 Histologic_type history_of_tumor_type 0.574640
3 Histologic_type primary_diagnosis 0.573583
4 Histologic_type additional_pathology_findings 0.562278
5 Histologic_type pathology_details 0.562007
6 Histologic_type pathology_reports 0.547307
7 Histologic_type relationship_primary_diagnosis 0.524285
8 Histologic_type diagnoses 0.519854
9 Histologic_type family_histories 0.516649

Viewing the column domains

To verify that primary_diagnosis is a good target column, we view and compare the domains of each column using the preview_domain() function. For the source table, it returns the list of unique values in the source column. For the GDC target, it returns the list of unique valid values that a column can have.

Here we see that the values seem to be related.

[7]:
bdi.preview_domain(dataset, "Histologic_type")
[7]:
value_name
0 Endometrioid
1 Carcinosarcoma
2 Serous
3 Clear cell
[8]:
bdi.preview_domain("gdc", "primary_diagnosis")
[8]:
value_name value_description column_description
0 Abdominal desmoid An insidious poorly circumscribed neoplasm ari... Text term used to describe the patient's histo...
1 Abdominal fibromatosis An insidious poorly circumscribed neoplasm ari...
2 Achromic nevus A benign nevus characterized by the absence of...
3 Acidophil adenocarcinoma A malignant epithelial neoplasm of the anterio...
4 Acidophil adenoma An epithelial neoplasm of the anterior pituita...
... ... ... ...
2620 Wolffian duct tumor An epithelial neoplasm of the female reproduct...
2621 Xanthofibroma A benign neoplasm composed of fibroblastic spi...
2622 Yolk sac tumor A non-seminomatous malignant germ cell tumor c...
2623 Unknown Not known, not observed, not recorded, or refu...
2624 Not Reported Not provided or available.

2625 rows × 3 columns

Since primary_diagnosis looks like a correct match for Histologic_type, we can modify the column_mappings variable directly.

[8]:
column_mappings.loc[column_mappings["source"] == "Histologic_type", "target"] = "primary_diagnosis"
column_mappings
[8]:
source target
0 Country country_of_birth
1 Histologic_type primary_diagnosis
2 FIGO_stage figo_stage
3 BMI hpv_positive_type
4 Age weight
5 Race race
6 Ethnicity ethnicity
7 Gender gender
8 Tumor_Focality tumor_focality
9 Tumor_Size_cm tumor_depth

Finding correct value mappings

After finding the correct column, we need to find appropriate value mappings. Using match_values(), we can inspect what the possible value mappings for this would look like after the harmonization.

bdi-kit implements multiple methods for value mapping discovery, including:

  • edit_distance - Computes value similarities using Levenstein’s edit distance measure.

  • tfidf - A method based on tf-idf importance weighting computed over charcter n-grams.

  • embeddings - Uses BERT word embeddings to compute “semantic similarity” between the values.

To specify a value mapping approach, we can pass the method parameter.

[9]:
bdi.match_values(
    dataset, column_mapping=("Histologic_type", "primary_diagnosis"), target="gdc", method="edit_distance"
)
[9]:
source target similarity
0 Carcinosarcoma Carcinosarcoma, NOS 0.848485
1 Clear cell Clear cell adenoma 0.714286
2 Endometrioid Stromal endometriosis 0.666667
3 Serous Neuronevus 0.625000
[10]:
bdi.match_values(
    dataset, column_mapping=("Histologic_type", "primary_diagnosis"), target="gdc", method="tfidf"
)
[10]:
source target similarity
0 Carcinosarcoma Carcinosarcoma, NOS 0.969
1 Endometrioid Endometrioid adenoma, NOS 0.897
2 Clear cell Clear cell adenoma 0.853
3 Serous Serous carcinoma, NOS 0.755
[11]:
bdi.match_values(
    dataset, column_mapping=("Histologic_type", "primary_diagnosis"), target="gdc", method="embedding"
)
[11]:
source target similarity
0 Carcinosarcoma Carcinofibroma 0.919
1 Endometrioid Endometrioid cystadenocarcinoma 0.810
2 Clear cell Clear cell carcinoma 0.760
3 Serous Serous cystoma 0.661
[12]:
hist_type_vmap = pd.DataFrame(
    columns=["source", "target"],
    data=[
        ("Carcinosarcoma", "Carcinosarcoma, NOS"),
        ("Clear cell", "Clear cell adenocarcinoma, NOS"),
        ("Endometrioid", "Endometrioid carcinoma"),
        ("Serous", "Serous cystadenocarcinoma"),
    ],
)
hist_type_vmap
[12]:
source target
0 Carcinosarcoma Carcinosarcoma, NOS
1 Clear cell Clear cell adenocarcinoma, NOS
2 Endometrioid Endometrioid carcinoma
3 Serous Serous cystadenocarcinoma

Verifying multiple value mappings at once

Besides verifying value mappings individually, you can also do it for all column mappings at once.

[13]:
mappings = bdi.match_values(
    dataset,
    column_mapping=column_mappings,
    target="gdc",
    method="tfidf",
)

for mapping in mappings:
    print(f"{mapping.attrs['source']} => {mapping.attrs['target']}")
    display(mapping)
    print("")
Country => country_of_birth
source target similarity
0 United States United States 1.0
1 Ukraine Ukraine 1.0
2 Poland Poland 1.0
3 nan None NaN
4 Other_specify None NaN

Histologic_type => primary_diagnosis
source target similarity
0 Carcinosarcoma Carcinosarcoma, NOS 0.969
1 Endometrioid Endometrioid adenoma, NOS 0.897
2 Clear cell Clear cell adenoma 0.853
3 Serous Serous carcinoma, NOS 0.755

FIGO_stage => figo_stage
source target similarity
0 IIIC2 Stage IIIC2 0.889
1 IIIC1 Stage IIIC1 0.889
2 IVB Stage IVB 0.854
3 IIIB Stage IIIB 0.849
4 IIIA Stage IIIA 0.822
5 II Stage III 0.687
6 IB Stage IB 0.649
7 IA Stage IA 0.586
8 nan Unknown 0.350

Race => race
source target similarity
0 White white 1.000
1 Asian asian 1.000
2 Not Reported not reported 1.000
3 Black or African American black or african american 1.000
4 nan american indian or alaska native 0.359

Ethnicity => ethnicity
source target similarity
0 Hispanic or Latino hispanic or latino 1.000
1 Not-Hispanic or Latino not hispanic or latino 0.935
2 Not reported not hispanic or latino 0.268
3 nan None NaN

Gender => gender
source target similarity
0 Female female 1.00
1 nan unknown 0.29

Tumor_Focality => tumor_focality
source target similarity
0 Unifocal Unifocal 1.0
1 Multifocal Multifocal 1.0
2 nan None NaN

Fixing remaining value mappings

We need fix a few value mappings: - Race - Ethnicity - Tumor_Site

For race, we need to fix: nan -> american indian or alaska native.

[14]:
race_vmap = bdi.match_values(
    dataset,
    column_mapping=("Race", "race"),
    target="gdc",
    method="tfidf",
)
race_vmap
[14]:
source target similarity
0 White white 1.000
1 Asian asian 1.000
2 Not Reported not reported 1.000
3 Black or African American black or african american 1.000
4 nan american indian or alaska native 0.359
[15]:
race_vmap = race_vmap[race_vmap["similarity"] >= 1.0]
race_vmap
[15]:
source target similarity
0 White white 1.0
1 Asian asian 1.0
2 Not Reported not reported 1.0
3 Black or African American black or african american 1.0

For Ethnicity, we need to fix: Not reported -> not hispanic or latino.

[16]:
ethinicity_vmap = bdi.match_values(
    dataset,
    column_mapping=("Ethnicity", "ethnicity"),
    target="gdc",
    method="tfidf",
)
ethinicity_vmap

[16]:
source target similarity
0 Hispanic or Latino hispanic or latino 1.000
1 Not-Hispanic or Latino not hispanic or latino 0.935
2 Not reported not hispanic or latino 0.268
3 nan None NaN
[17]:
ethinicity_vmap = ethinicity_vmap[ethinicity_vmap["similarity"] > 0.9]
ethinicity_vmap
[17]:
source target similarity
0 Hispanic or Latino hispanic or latino 1.000
1 Not-Hispanic or Latino not hispanic or latino 0.935

For Tumor_Site, given that this dataset is about endometrial cancer, all values must be mapped to “Endometrium”. So instead of fixing each mapping individually, we will write a custom function that returns “Endometrium” regardless of the input value. Later, we will show how to use this function to transform the dataset.

[18]:
bdi.match_values(
    dataset, column_mapping=("Tumor_Site", "tissue_or_organ_of_origin"), target="gdc", method="tfidf"
)
[18]:
source target similarity
0 Anterior endometrium Endometrium 0.852
1 Posterior endometrium Endometrium 0.823
2 Other, specify Other specified parts of pancreas 0.543
3 nan Anal canal 0.301
[19]:
# Custom mapping function that will be used to map the values of the 'Tumor_Site' column
def map_tumor_site(source_value):
    return "Endometrium"

Combining custom user mappings with suggested mappings

Before generating a final harmonized dataset, we can combine the automatically generated value mappings with the fixed mappings provided by the user. To do so, we use bdi.merge_mappings() function, which take a list of mappings (e.g., generated automatically) and a list of “user-defined mapping overrides” that will be combined with the first list of mappings and will take precedence whenever they conflict.

In our example below, all mappings specified in the variable user_mappings will override the mappings in value_mappings generated by the bdi.match_values() function.

[20]:
from math import ceil

user_mappings = [
    {
        # When no mapping is need, specifying the source and target is enough
        "source": "BMI",
        "target": "bmi",
    },
    {
        "source": "Tumor_Size_cm",
        "target": "tumor_largest_dimension_diameter",
    },
    {
        # mapper can be a custom Python function
        "source": "Tumor_Site",
        "target": "tissue_or_organ_of_origin",
        "mapper": map_tumor_site,
    },
    {
        # Lambda functions can also be used as mappers
        "source": "Age",
        "target": "days_to_birth",
        "mapper": lambda age: -age * 365.25,
    },
    {
        "source": "Age",
        "target": "age_at_diagnosis",
        "mapper": lambda age: float("nan") if pd.isnull(age) else ceil(age*365.25),
    },
    {
        # We can also use a data frame to specify value mappings using the `matches` attribute
        "source": "Histologic_type",
        "target": "primary_diagnosis",
        "matches": hist_type_vmap
    },
    # For dataframes that contain the 'source' and 'target' columns as attributes,
    # such as the ones returned by the match_values() function, we can directly
    # use them as mappings
    ethinicity_vmap,
    race_vmap,
]


harmonization_spec = bdi.merge_mappings(value_mappings, user_mappings)

Finally, we generate the harmonized dataset, with the user-defined value mappings.

[21]:
harmonized_dataset = bdi.materialize_mapping(dataset, harmonization_spec)
harmonized_dataset
[21]:
tissue_or_organ_of_origin bmi days_to_birth age_at_diagnosis tumor_largest_dimension_diameter country_of_birth primary_diagnosis figo_stage race ethnicity gender tumor_focality
0 Endometrium 38.88 -23376.00 23376.0 2.9 United States Endometrioid adenoma, NOS Stage IA white not hispanic or latino female Unifocal
1 Endometrium 39.76 -21184.50 21185.0 3.5 United States Endometrioid adenoma, NOS Stage IA white not hispanic or latino female Unifocal
2 Endometrium 51.19 -18262.50 18263.0 4.5 United States Endometrioid adenoma, NOS Stage IA white not hispanic or latino female Unifocal
3 Endometrium NaN NaN NaN NaN NaN Carcinosarcoma, NOS NaN NaN NaN NaN NaN
4 Endometrium 32.69 -27393.75 27394.0 3.5 United States Endometrioid adenoma, NOS Stage IA white not hispanic or latino female Unifocal
... ... ... ... ... ... ... ... ... ... ... ... ...
99 Endometrium 29.40 -27393.75 27394.0 4.2 Ukraine Endometrioid adenoma, NOS Stage IA NaN NaN female Unifocal
100 Endometrium 35.42 -27028.50 27029.0 1.5 Ukraine Endometrioid adenoma, NOS Stage III NaN NaN female Unifocal
101 Endometrium 24.32 -31046.25 31047.0 3.8 United States Serous carcinoma, NOS Stage III black or african american not hispanic or latino female Unifocal
102 Endometrium 34.06 -25567.50 25568.0 5.0 Ukraine Serous carcinoma, NOS Stage IA NaN NaN female Unifocal
103 Endometrium NaN NaN NaN NaN Ukraine Serous carcinoma, NOS NaN NaN NaN NaN NaN

104 rows × 12 columns

For comparison, here is how our original data looked like:

[22]:
original_columns = map(lambda m: m["source"], harmonization_spec)
dataset[original_columns]
[22]:
Tumor_Site BMI Age Age Tumor_Size_cm Country Histologic_type FIGO_stage Race Ethnicity Gender Tumor_Focality
0 Anterior endometrium 38.88 64.0 64.0 2.9 United States Endometrioid IA White Not-Hispanic or Latino Female Unifocal
1 Posterior endometrium 39.76 58.0 58.0 3.5 United States Endometrioid IA White Not-Hispanic or Latino Female Unifocal
2 Other, specify 51.19 50.0 50.0 4.5 United States Endometrioid IA White Not-Hispanic or Latino Female Unifocal
3 NaN NaN NaN NaN NaN NaN Carcinosarcoma NaN NaN NaN NaN NaN
4 Other, specify 32.69 75.0 75.0 3.5 United States Endometrioid IA White Not-Hispanic or Latino Female Unifocal
... ... ... ... ... ... ... ... ... ... ... ... ...
99 Other, specify 29.40 75.0 75.0 4.2 Ukraine Endometrioid IA NaN NaN Female Unifocal
100 Other, specify 35.42 74.0 74.0 1.5 Ukraine Endometrioid II NaN NaN Female Unifocal
101 Other, specify 24.32 85.0 85.0 3.8 United States Serous II Black or African American Not-Hispanic or Latino Female Unifocal
102 Other, specify 34.06 70.0 70.0 5.0 Ukraine Serous IA NaN NaN Female Unifocal
103 NaN NaN NaN NaN NaN Ukraine Serous NaN NaN NaN NaN NaN

104 rows × 12 columns