bdikit.api (Module)

match_schema(source: pandas.DataFrame | BaseStandard, target: pandas.DataFrame | BaseStandard | str | None = 'gdc', method: str | BaseSchemaMatcher | None = 'magneto_ft_bp', source_context: Dict[str, Any] | None = None, target_context: Dict[str, Any] | None = None, method_args: Dict[str, Any] | None = None, standard_args: Dict[str, Any] | None = None, use_cache: bool | None = True) pandas.DataFrame

Performs schema matching between the source table and the given target schema. The target is either a DataFrame or a string representing a standard data model supported by the library. Currently, only the GDC (Genomic Data Commons) and Synapse standard data model is supported.

Parameters:
  • source (Union[pd.DataFrame, BaseStandard, str]) – The source table or standard data model to be mapped.

  • target (Union[pd.DataFrame, BaseStandard, str], optional) – The target table or standard data model. Defaults to “gdc”.

  • method (Union[str, BaseSchemaMatcher], optional) – The method to use for schema matching. Defaults to “magneto_ft_bp”.

  • source_context (Dict[str, Any], optional) – The context for the source dataset, which can include additional information, such as descriptions or metadata.

  • target_context (Dict[str, Any], optional) – The context for the target dataset, which can include additional information, such as descriptions or metadata.

  • method_args (Dict[str, Any], optional) – The additional arguments of the method for schema matching.

  • standard_args (Dict[str, Any], optional) – The additional arguments of the standard data model (target).

  • use_cache (bool, optional) – Whether to use caching for the matches results. Defaults to True.

Returns:

A DataFrame containing the matches results with columns “source_attribute”, “target_attribute” and “similarity”.

Return type:

pd.DataFrame

Raises:

ValueError – If the method is neither a string nor an instance of BaseSchemaMatcher.

rank_schema_matches(source: pandas.DataFrame | BaseStandard, target: pandas.DataFrame | BaseStandard | str | None = 'gdc', attributes: List[str] | None = None, top_k: int | None = 10, method: str | BaseTopkSchemaMatcher | None = 'magneto_ft_bp', source_context: Dict[str, Any] | None = None, target_context: Dict[str, Any] | None = None, method_args: Dict[str, Any] | None = None, standard_args: Dict[str, Any] | None = None, use_cache: bool | None = True) pandas.DataFrame

Returns the top-k matches between the source and target tables.

Parameters:
  • source (Union[pd.DataFrame, BaseStandard, str]) – The source table or standard data model to be mapped.

  • target (Union[pd.DataFrame, BaseStandard, str], optional) – The target table or standard data model. Defaults to “gdc”.

  • attributes (List[str], optional) – The list of attributes/columns to consider for matching. Defaults to None.

  • top_k (int, optional) – The number of top matches to return. Defaults to 10.

  • method (Union[str, BaseSchemaMatcher], optional) – The method to use for schema matching. Defaults to “magneto_ft_bp”.

  • source_context (Dict[str, Any], optional) – The context for the source dataset, which can include additional information, such as descriptions or metadata.

  • target_context (Dict[str, Any], optional) – The context for the target dataset, which can include additional information, such as descriptions or metadata.

  • method_args (Dict[str, Any], optional) – The additional arguments of the method for schema matching.

  • standard_args (Dict[str, Any], optional) – The additional arguments of the standard data model.

  • use_cache (bool, optional) – Whether to use caching for the matches results. Defaults to True.

Returns:

A DataFrame containing the top-k matches between the source and target tables.

Return type:

pd.DataFrame

match_values(source: pandas.DataFrame | BaseStandard, target: pandas.DataFrame | BaseStandard | str | None, attribute_matches: Tuple[str, str] | pandas.DataFrame, method: str | BaseValueMatcher | None = 'tfidf', source_context: Dict[str, Any] | None = None, target_context: Dict[str, Any] | None = None, method_args: Dict[str, Any] | None = None, standard_args: Dict[str, Any] | None = None, output_format: str | None = 'dataframe', use_cache: bool | None = True) pandas.DataFrame | List[pandas.DataFrame]

Finds matches between attribute values from the source dataset and attribute values of the target domain (a pd.DataFrame or a standard dictionary such as ‘gdc’) using the method provided in method.

Parameters:
  • source (pd.DataFrame) – The source dataset containing the attributes/columns to be matched.

  • target (Union[str, pd.DataFrame]) – The target domain to match the values to. It can be either a DataFrame or a standard vocabulary name.

  • attribute_matches (Union[Tuple[str, str], pd.DataFrame]) –

    A tuple or a DataFrame containing the mappings between source and target attributes.

    • If a tuple is provided, it should contain two strings where the first is the source attribute and the second is the target attribute.

    • If a DataFrame is provided, it should contain ‘source_attribute’ and ‘target_attribute’ column names where each row specifies a attribute mapping.

  • method (str, optional) – The name of the method to use for value matching.

  • source_context (Dict[str, Any], optional) – The context for the source dataset, which can include additional information, such as descriptions or metadata.

  • target_context (Dict[str, Any], optional) – The context for the target dataset, which can include additional information, such as descriptions or metadata.

  • method_args (Dict[str, Any], optional) – The additional arguments of the method for value matching.

  • standard_args (Dict[str, Any], optional) – The additional arguments of the standard data model.

  • output_format (str, optional) – The format of the output. If “dataframe”, a single DataFrame is returned. If “list”, a list of DataFrames is returned. Defaults to “dataframe”.

  • use_cache (bool, optional) – Whether to use caching for the value matching results. Defaults to True.

Returns:

A DataFrame or a List of DataFrames containing the results of value matching between the source and target values.

Return type:

pd.DataFrame

Raises:
  • ValueError – If the attribute_matches DataFrame does not contain ‘source_attribute’ and ‘target_attribute’ columns.

  • ValueError – If the target is neither a DataFrame nor a standard vocabulary name.

  • ValueError – If the source attribute is not present in the source dataset.

rank_value_matches(source: pandas.DataFrame, target: str | pandas.DataFrame, attribute_matches: Tuple[str, str] | pandas.DataFrame, top_k: int | None = 5, method: str | BaseTopkValueMatcher | None = 'tfidf', source_context: Dict[str, Any] | None = None, target_context: Dict[str, Any] | None = None, method_args: Dict[str, Any] | None = None, standard_args: Dict[str, Any] | None = None, output_format: str | None = 'dataframe', use_cache: bool | None = True) pandas.DataFrame | List[pandas.DataFrame]

Finds top value matches between attribute values from the source dataset and attribute values of the target domain (a pd.DataFrame or a standard dictionary such as ‘gdc’) using the method provided in method.

Parameters:
  • source (pd.DataFrame) – The source dataset containing the attributes to be matched.

  • target (Union[str, pd.DataFrame]) – The target domain to match the values to. It can be either a DataFrame or a standard vocabulary name.

  • attribute_matches (Union[Tuple[str, str], pd.DataFrame]) –

    A tuple or a DataFrame containing the mappings between source and target attributes.

    • If a tuple is provided, it should contain two strings where the first is the source attribute and the second is the target attribute.

    • If a DataFrame is provided, it should contain ‘source_attribute’ and ‘target_attribute’ column names where each row specifies a attribute mapping.

  • top_k (int, optional) – The number of top matches to return. Defaults to 5.

  • method (str, optional) – The name of the method to use for value matching.

  • source_context (Dict[str, Any], optional) – The context for the source dataset, which can include additional information, such as descriptions or metadata.

  • target_context (Dict[str, Any], optional) – The context for the target dataset, which can include additional information, such as descriptions or metadata.

  • method_args (Dict[str, Any], optional) – The additional arguments of the method for value matching.

  • standard_args (Dict[str, Any], optional) – The additional arguments of the standard vocabulary.

  • output_format (str, optional) – The format of the output. If “dataframe”, a single DataFrame is returned. If “list”, a list of DataFrames is returned. Defaults to “dataframe”.

  • use_cache (bool, optional) – Whether to use caching for the value matching results. Defaults to True.

Returns:

A DataFrame or a List of DataFrames containing the results of value matching between the source and target values.

Return type:

pd.DataFrame

Raises:
  • ValueError – If the attribute_matches DataFrame does not contain ‘source_attribute’ and ‘target_attribute’ columns.

  • ValueError – If the target is neither a DataFrame nor a standard vocabulary name.

  • ValueError – If the source attribute is not present in the source dataset.

view_value_matches(matches: pandas.DataFrame | List[pandas.DataFrame], edit: bool = False)

Shows the value match results grouped by source and target attributes matches

Parameters:
  • matches (Union[pd.DataFrame, List[pd.DataFrame]]) – The value match results obtained by the method match_values() or rank_value_matches().

  • edit (bool, optional) – Whether or not to edit the values within the DataFrame. Editable mode works only in Jupyter notebooks.

preview_domain(dataset: pandas.DataFrame | BaseStandard | str, attribute: str, limit: int | None = None, standard_args: Dict[str, Any] | None = None) pandas.DataFrame

Preview the domain, i.e. set of unique values, attribute description and value description (if applicable) of the given attribute of the source or target dataset.

Parameters:
  • dataset (Union[pd.DataFrame, BaseStandard, str], optional) – The dataset or standard data model name containing the attribute to preview. If a BaseStandard or string is provided, the domain will be retrieved from the standard data model. If a DataFrame is provided, the domain will be retrieved from the specified DataFrame.

  • attribute (str) – The attribute name to show the domain.

  • limit (int, optional) – The maximum number of unique values to include in the preview. Defaults to None.

  • standard_args (Dict[str, Any], optional) – The additional arguments of the standard data model.

Returns:

A DataFrame containing the unique domain values (or a sample of them if the parameter limit was specified), attribute description and value description (if applicable).

Return type:

pd.DataFrame

evaluate_schema_matches(source: pandas.DataFrame, target: str | pandas.DataFrame, schema_matches: pandas.DataFrame, standard_args: Dict[str, Any] | None = None) pandas.DataFrame

Evaluates the schema matches by providing a response and an explanation for each match.

Parameters:
  • source (pd.DataFrame) – The source dataset.

  • target (Union[str, pd.DataFrame]) – The target dataset or standard data model name.

  • schema_matches (pd.DataFrame) – The DataFrame containing the schema matches with columns ‘source_attribute’, ‘target_attribute’, and ‘similarity’.

  • standard_args (Dict[str, Any], optional) – Additional arguments for the standard data model.

Returns:

A DataFrame containing the evaluated matches with additional columns ‘response’ and ‘explanation’.

Return type:

pd.DataFrame

evaluate_value_matches(source: pandas.DataFrame, target: str | pandas.DataFrame, value_matches: pandas.DataFrame, standard_args: Dict[str, Any] | None = None) pandas.DataFrame

Evaluates the value matches by providing a response and an explanation for each match.

Parameters:
  • source (pd.DataFrame) – The source dataset.

  • target (Union[str, pd.DataFrame]) – The target dataset or standard data model name.

  • value_matches (pd.DataFrame) – The DataFrame containing the value matches with columns ‘source_attribute’, ‘target_attribute’, ‘source_value’, ‘target_value’, and ‘similarity’.

  • standard_args (Dict[str, Any], optional) – Additional arguments for the standard data model.

Returns:

A DataFrame containing the evaluated matches with additional columns ‘response’ and ‘explanation’.

Return type:

pd.DataFrame

create_harmonization_spec(mappings: MappingSpecLike, user_mappings: MappingSpecLike | None = None) List

Creates a “data harmonization” plan based on the provided schema and/or value mappings. These mappings can either be computed by the library’s functions or provided by the user. If the user mappings are provided (using the user_mappings parameter), they will take precedence over the mappings provided in the first parameter.

Parameters:
  • mappings (MappingSpecLike) – The value mappings used to create the data harmonization plan. It can be a DataFrame, a list of dictionaries or a list of DataFrames.

  • user_mappings (MappingSpecLike, optional) – The user mappings to be included in the update. It can be a DataFrame, a list of dictionaries or a list of DataFrames. Defaults to None.

Returns:

The data harmonization plan that can be used as input to the materialize_mapping() function. Concretely, the harmonization plan is a list of dictionaries, where each dictionary contains the source column, target column, and mapper object that will be used to transform the input to the output data.

Return type:

List

Raises:

ValueError – If there are duplicate mappings for the same source and target columns.

merge_mappings(mappings: MappingSpecLike, user_mappings: MappingSpecLike | None = None) List

Deprecated: Use create_harmonization_spec() instead.

This function is deprecated and will be removed in a future version. Please use create_harmonization_spec() which provides the same functionality.

materialize_mapping(input_table: pandas.DataFrame, mapping_spec: MappingSpecLike) pandas.DataFrame

Takes an input DataFrame and a target mapping specification and returns a new DataFrame created according to the given target mapping specification. The mapping specification is a list of dictionaries, where each dictionary defines one attribute/column in the output table and how it is created. It includes the names of the input (source) and output (target) attributes and the value mapper used to transform the values of the input attribute into the target output attribute.

Parameters:
  • input_table (pd.DataFrame) – The input (source) DataFrame.

  • mapping_spec (MappingSpecLike) – The target mapping specification. It can be a DataFrame, a list of dictionaries or a list of DataFrames.

Returns:

A DataFrame, which is created according to the target mapping specifications.

Return type:

pd.DataFrame

create_mapper(input: None | ValueMapper | pandas.DataFrame | List[ValueMatch] | Dict | ColumnMappingSpec | Callable[[pandas.Series], pandas.Series])

Tries to instantiate an appropriate ValueMapper object for the given input argument. Depending on the input type, it may create one of the following objects:

  • If input is None, it creates an IdentityValueMapper object.

  • If input is a ValueMapper, it returns the input object.

  • If input is a function (or lambda function), it creates a FunctionValueMapper object.

  • If input is a list of ValueMatch objects or tuples (<source_value>, <target_value>), it creates a DictionaryMapper object.

  • If input is a DataFrame with two columns (“source_value”, “target_value”), it creates a DictionaryMapper object.

  • If input is a dictionary containing a “source” and “target” key, it tries to create a ValueMapper object based on the specification given in “mapper” or “matches” keys.

Parameters:

input – The input argument to create a ValueMapper object from.

Returns:

An instance of a ValueMapper.

Return type:

ValueMapper

MappingSpecLike

The MappingSpecLike is a type alias that specifies mappings between source and target attributes/columns. It must include the source and target attribute names and a value mapper object that transforms the values of the source attribute into the target.

The mapping specification can be (1) a DataFrame or (2) a list of dictionaries or DataFrames.

If it is a list of dictionaries, they must have:

  • source_attribute: The name of the source attribute/column.

  • target_attribute: The name of the target attribute/column.

  • mapper (optional): A ValueMapper instance or an object that can be used to create one using create_mapper(). Examples of valid objects are Python functions or lambda functions. If empty, an IdentityValueMapper is used by default.

  • matches (optional): Specifies the value mappings. It can be a DataFrame containing the matches (returned by match_values()), a list of ValueMatch objects, or a list of tuples (<source_value>, <target_value>).

Alternatively, the list can contain DataFrames. In this case, the DataFrames must contain not only the value mappings (as described in the matches key above) but also the source_attribute and target_attribute columns as DataFrame attributes. The DataFrames created by match_values() include this information by default.

If the mapping specification is a DataFrame, it must be compatible with the dictionaries above and contain source_attribute, target_attribute, and mapper or matcher columns.

Example:

mapping_spec = [
  {
    "source_attribute": "source_column1",
    "target_attribute": "target_column1",
  },
  {
    "source_attribute": "source_column2",
    "target_attribute": "target_column2",
    "mapper": lambda age: -age * 365.25,
  },
  {
    "source_attribute": "source_column3",
    "target_attribute": "target_column3",
    "matches": [
      ("source_value1", "target_value1"),
      ("source_value2", "target_value2"),
    ]
  },
  {
    "source_attribute": "source_column",
    "target_attribute": "target_column",
    "matches": df_value_mapping_1
  },
  df_value_mapping_2, # a DataFrame returned by match_values()
]

alias of List[Dict | DataFrame] | DataFrame