Value Matching Methods

This page provides an overview of all value matching methods available in the bdikit library. Some methods reuse the implementation of other libraries such as PolyFuzz (e.g, embedding and tfidf) while others are implemented originally for bdikit (e.g., gpt). To see how to use these methods, please refer to the documentation of match_values() in the api module.

bdikit methods

Method

Class

Description

gpt

GPTValueMatcher

Leverages a large language model (GPT-4) to identify and select the most accurate value matches.
Methods from other libraries

Method

Class

Description

tfidf

TFIDFValueMatcher

Employs a character-based n-gram TF-IDF approach to approximate edit distance by capturing the frequency and contextual importance of n-gram patterns within strings. This method leverages the Term Frequency-Inverse Document Frequency (TF-IDF) weighting to quantify the similarity between strings based on their shared n-gram features.

edit_distance

EditDistanceValueMatcher

Uses the edit distance between lists of strings using a customizable scorer that supports various distance and similarity metrics.

embedding

EmbeddingValueMatcher

A value-matching algorithm that leverages the cosine similarity of value embeddings for precise comparisons. By default, it utilizes the bert-base-multilingual-cased model to generate contextualized embeddings, enabling effective multilingual matching.​.

fasttext

FastTextValueMatcher

This method uses the cosine similarity of FastText embeddings to accurately compare and align values, capturing both semantic and subword-level similarities..