bdikit.api (Module)
- match_schema(source: pandas.DataFrame, target: str | pandas.DataFrame = 'gdc', method: str | BaseSchemaMatcher = 'coma', method_args: Dict[str, Any] | None = None) pandas.DataFrame
Performs schema mapping between the source table and the given target schema. The target is either a DataFrame or a string representing a standard data vocabulary supported by the library. Currently, only the GDC (Genomic Data Commons) standard vocabulary is supported.
- Parameters:
source (pd.DataFrame) – The source table to be mapped.
target (Union[str, pd.DataFrame], optional) – The target table or standard data vocabulary. Defaults to “gdc”.
method (str, optional) – The method used for mapping. Defaults to “coma”.
method_args (Dict[str, Any], optional) – The additional arguments of the method for schema matching.
- Returns:
A DataFrame containing the mapping results with columns “source” and “target”.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the method is neither a string nor an instance of BaseColumnMappingAlgorithm.
- top_matches(source: pandas.DataFrame, columns: List[str] | None = None, target: str | pandas.DataFrame = 'gdc', top_k: int = 10, method: str | TopkColumnMatcher = 'ct_learning', method_args: Dict[str, Any] | None = None) pandas.DataFrame
Returns the top-k matches between the source and target tables.
- Parameters:
source (pd.DataFrame) – The source table.
columns (Optional[List[str]], optional) – The list of columns to consider for matching. Defaults to None.
target (Union[str, pd.DataFrame], optional) – The target table or the name of the standard target table. Defaults to “gdc”.
top_k (int, optional) – The number of top matches to return. Defaults to 10.
- Returns:
A DataFrame containing the top-k matches between the source and target tables.
- Return type:
pd.DataFrame
- match_values(source: pandas.DataFrame, target: str | pandas.DataFrame, column_mapping: Tuple[str, str] | pandas.DataFrame, method: str = 'tfidf', method_args: Dict[str, Any] | None = None) pandas.DataFrame | List[pandas.DataFrame]
Finds matches between column values from the source dataset and column values of the target domain (a pd.DataFrame or a standard dictionary such as ‘gdc’) using the method provided in method.
- Parameters:
source (pd.DataFrame) – The source dataset containing the columns to be matched.
target (Union[str, pd.DataFrame]) – The target domain to match the values to. It can be either a DataFrame or a standard vocabulary name.
column_mapping (Union[Tuple[str, str], pd.DataFrame]) –
A tuple or a DataFrame containing the mappings between source and target columns.
If a tuple is provided, it should contain two strings where the first is the source column and the second is the target column.
If a DataFrame is provided, it should contain ‘source’ and ‘target’ column names where each row specifies a column mapping.
method (str, optional) – The name of the method to use for value matching.
method_args (Dict[str, Any], optional) – The additional arguments of the method for value matching.
- Returns:
A list of DataFrame objects containing the results of value matching between the source and target values. If a tuple is provided as the column_mapping, only a DataFrame instance is returned.
- Return type:
Union[pd.DataFrame, List[pd.DataFrame]]
- Raises:
ValueError – If the column_mapping DataFrame does not contain ‘source’ and ‘target’ columns.
ValueError – If the target is neither a DataFrame nor a standard vocabulary name.
ValueError – If the source column is not present in the source dataset.
- top_value_matches(source: pandas.DataFrame, target: str | pandas.DataFrame, column_mapping: Tuple[str, str] | pandas.DataFrame, top_k: int = 5, method: str = 'tfidf', method_args: Dict[str, Any] | None = None) List[pandas.DataFrame]
Finds top value matches between column values from the source dataset and column values of the target domain (a pd.DataFrame or a standard dictionary such as ‘gdc’) using the method provided in method.
- Parameters:
source (pd.DataFrame) – The source dataset containing the columns to be matched.
target (Union[str, pd.DataFrame]) – The target domain to match the values to. It can be either a DataFrame or a standard vocabulary name.
column_mapping (Union[Tuple[str, str], pd.DataFrame]) –
A tuple or a DataFrame containing the mappings between source and target columns.
If a tuple is provided, it should contain two strings where the first is the source column and the second is the target column.
If a DataFrame is provided, it should contain ‘source’ and ‘target’ column names where each row specifies a column mapping.
top_k (int, optional) – The number of top matches to return. Defaults to 5.
method (str, optional) – The name of the method to use for value matching.
method_args (Dict[str, Any], optional) – The additional arguments of the method for value matching.
- Returns:
A list of DataFrame objects containing the results of value matching between the source and target values.
- Return type:
List[pd.DataFrame]
- Raises:
ValueError – If the column_mapping DataFrame does not contain ‘source’ and ‘target’ columns.
ValueError – If the target is neither a DataFrame nor a standard vocabulary name.
ValueError – If the source column is not present in the source dataset.
- view_value_matches(matches: pandas.DataFrame | List[pandas.DataFrame], edit: bool = False)
Shows the value match results in a DataFrame fashion.
- Parameters:
matches (Union[pd.DataFrame, List[pd.DataFrame]]) – The value match results obtained by the method match_values().
edit (bool) – Whether or not to edit the values within the DataFrame.
- preview_domain(dataset: str | pandas.DataFrame, column: str, limit: int | None = None) pandas.DataFrame
Preview the domain, i.e. set of unique values, column description and value description (if applicable) of the given column of the source or target dataset.
- Parameters:
dataset (Union[str, pd.DataFrame], optional) – The dataset or standard vocabulary name
preview. (containing the column to) – If a string is provided and it is equal to “gdc”, the domain will be retrieved from the GDC data. If a DataFrame is provided, the domain will be retrieved from the specified DataFrame.
column (str) – The column name to show the domain.
limit (int, optional) – The maximum number of unique values to include in the preview. Defaults to None.
- Returns:
A DataFrame containing the unique domain values (or a sample of them if the parameter limit was specified), column description and value description (if applicable).
- Return type:
pd.DataFrame
- merge_mappings(mappings: MappingSpecLike, user_mappings: MappingSpecLike | None = None) List
Creates a “data harmonization” plan based on the provided schema and/or value mappings. These mappings can either be computed by the library’s functions or provided by the user. If the user mappings are provided (using the user_mappings parameter), they will take precedence over the mappings provided in the first parameter.
- Parameters:
mappings (MappingSpecLike) – The value mappings used to create the data harmonization plan. It can be a DataFrame, a list of dictionaries or a list of DataFrames.
user_mappings (Optional[MappingSpecLike]) – The user mappings to be included in the update. It can be a DataFrame, a list of dictionaries or a list of DataFrames. Defaults to None.
- Returns:
The data harmonization plan that can be used as input to the
materialize_mapping()
function. Concretely, the harmonization plan is a list of dictionaries, where each dictionary contains the source column, target column, and mapper object that will be used to transform the input to the output data.- Return type:
List
- Raises:
ValueError – If there are duplicate mappings for the same source and target columns.
- materialize_mapping(input_table: pandas.DataFrame, mapping_spec: MappingSpecLike) pandas.DataFrame
Takes an input DataFrame and a target mapping specification and returns a new DataFrame created according to the given target mapping specification. The mapping specification is a list of dictionaries, where each dictionary defines one column in the output table and how it is created. It includes the names of the input (source) and output (target) columns and the value mapper used to transform the values of the input column into the target output column.
- Parameters:
input_table (pd.DataFrame) – The input (source) DataFrame.
mapping_spec (MappingSpecLike) – The target mapping specification. It can be a DataFrame, a list of dictionaries or a list of DataFrames.
- Returns:
A DataFrame, which is created according to the target mapping specifications.
- Return type:
pd.DataFrame
- create_mapper(input: None | ValueMapper | pandas.DataFrame | ValueMatchingResult | List[ValueMatch] | Dict | ColumnMappingSpec | Callable[[pandas.Series], pandas.Series])
Tries to instantiate an appropriate ValueMapper object for the given input argument. Depending on the input type, it may create one of the following objects:
If input is None, it creates an IdentityValueMapper object.
If input is a ValueMapper, it returns the input object.
If input is a function (or lambda function), it creates a FunctionValueMapper object.
If input is a list of ValueMatch objects or tuples (<source_value>, <target_value>), it creates a DictionaryMapper object.
If input is a DataFrame with two columns (“source_value”, “target_value”), it creates a DictionaryMapper object.
If input is a dictionary containing a “source” and “target” key, it tries to create a ValueMapper object based on the specification given in “mapper” or “matches” keys.
- Parameters:
input – The input argument to create a ValueMapper object from.
- Returns:
An instance of a ValueMapper.
- Return type:
ValueMapper
- MappingSpecLike
The MappingSpecLike is a type alias that specifies mappings between source and target columns. It must include the source and target column names and a value mapper object that transforms the values of the source column into the target.
The mapping specification can be (1) a DataFrame or (2) a list of dictionaries or DataFrames.
If it is a list of dictionaries, they must have:
source: The name of the source column.
target: The name of the target column.
mapper (optional): A ValueMapper instance or an object that can be used to create one using
create_mapper()
. Examples of valid objects are Python functions or lambda functions. If empty, an IdentityValueMapper is used by default.matches (optional): Specifies the value mappings. It can be a DataFrame containing the matches (returned by
match_values()
), a list of ValueMatch objects, or a list of tuples (<source_value>, <target_value>).
Alternatively, the list can contain DataFrames. In this case, the DataFrames must contain not only the value mappings (as described in the matches key above) but also the source and target columns as DataFrame attributes. The DataFrames created by
match_values()
include this information by default.If the mapping specification is a DataFrame, it must be compatible with the dictionaries above and contain source, target, and mapper or matcher columns.
Example:
mapping_spec = [ { "source": "source_column1", "target": "target_column1", }, { "source": "source_column2", "target": "target_column2", "mapper": lambda age: -age * 365.25, }, { "source": "source_column3", "target": "target_column3", "matches": [ ("source_value1", "target_value1"), ("source_value2", "target_value2"), ] }, { "source": "source_column", "target": "target_column", "matches": df_value_mapping_1 }, df_value_mapping_2, # a DataFrame returned by match_values() ]
alias of
Union
[List
[Union
[Dict
,DataFrame
]],DataFrame
]