Value Matching Methods

This page provides an overview of all value matching methods available in the BDI-Kit library. Some methods reuse the implementation of other libraries such as PolyFuzz (e.g, embedding and tfidf) while others are implemented originally for bdikit (e.g., gpt). To see how to use these methods, please refer to the documentation of match_values() in the api module.

bdikit methods

Method

Class

Description

llm

LLM

Leverages LLMs to identify and select the most accurate value matches. Supports multiple models, with gpt-4o-mini used as the default.

llm_numeric

LLMNumeric

Employs LLMs to perform numeric value transformations, such as converting ages from years to months. Supports multiple models, with gpt-4o-mini used as the default.
Methods from other libraries

Method

Class

Description

tfidf

TFIDF

Employs a character-based n-gram TF-IDF approach to approximate edit distance by capturing the frequency and contextual importance of n-gram patterns within strings. This method leverages the Term Frequency-Inverse Document Frequency (TF-IDF) weighting to quantify the similarity between strings based on their shared n-gram features.

edit_distance

EditDistance

Uses the edit distance between lists of strings using a customizable scorer that supports various distance and similarity metrics.

embedding

Embeddings

A value-matching algorithm that leverages the cosine similarity of value embeddings for precise comparisons. By default, it utilizes the bert-base-multilingual-cased model to generate contextualized embeddings, enabling effective multilingual matching.​