| Title: | Robust Probabilistic Matching for German Company Names |
|---|---|
| Description: | A pipeline for matching messy company name strings against a clean dictionary (e.g., 'Orbis'). Implements a cascading strategy: Exact -> Fuzzy ('zoomerjoin') -> 'FTS5' ('SQLite') -> Rarity Weighted. References: Beniamino Green (2025) <https://beniamino.org/zoomerjoin/>; <https://www.sqlite.org/fts5.html>. |
| Authors: | Giulian Etingin-Frati [aut, cre] |
| Maintainer: | Giulian Etingin-Frati <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.3 |
| Built: | 2026-05-16 05:17:02 UTC |
| Source: | https://github.com/swediot/firmmatchr |
Runs a cascading matching pipeline: Exact -> Fuzzy (Zoomer) -> FTS5 -> Rarity. Matches found in earlier steps are removed from subsequent steps.
match_companies( queries, dictionary, query_col = "company_name", dict_col = "company_name", unique_id_col = "query_id", dict_id_col = "orbis_id", threshold_jw = 0.8, threshold_zoomer = 0.4, threshold_rarity = 1, n_cores = 1 )match_companies( queries, dictionary, query_col = "company_name", dict_col = "company_name", unique_id_col = "query_id", dict_id_col = "orbis_id", threshold_jw = 0.8, threshold_zoomer = 0.4, threshold_rarity = 1, n_cores = 1 )
queries |
Data frame. Must contain columns specified in |
dictionary |
Data frame. Must contain columns specified in |
query_col |
String. Column name for company names in |
dict_col |
String. Column name for company names in |
unique_id_col |
String. ID column in |
dict_id_col |
String. ID column in |
threshold_jw |
Numeric (0-1). Minimum Jaro-Winkler similarity. Default 0.8. |
threshold_zoomer |
Numeric (0-1). Jaccard threshold for blocking. Default 0.4. |
threshold_rarity |
Numeric. Minimum score for rarity matching. Default 1.0. |
n_cores |
Integer. Number of cores (reserved for future parallel implementation). |
A data.table containing query_id, dict_id, and match_type.
# Create sample query data queries <- data.frame( query_id = 1:3, company_name = c("BMW", "Siemens AG", "Deutsche Bank") ) # Create sample dictionary dictionary <- data.frame( orbis_id = c("D001", "D002", "D003"), company_name = c("BMW AG", "Siemens Aktiengesellschaft", "Commerzbank AG") ) # Match companies (uses multi-threaded Rust internals via zoomerjoin) results <- match_companies( queries = queries, dictionary = dictionary, query_col = "company_name", dict_col = "company_name", unique_id_col = "query_id", dict_id_col = "orbis_id" ) print(results)# Create sample query data queries <- data.frame( query_id = 1:3, company_name = c("BMW", "Siemens AG", "Deutsche Bank") ) # Create sample dictionary dictionary <- data.frame( orbis_id = c("D001", "D002", "D003"), company_name = c("BMW AG", "Siemens Aktiengesellschaft", "Commerzbank AG") ) # Match companies (uses multi-threaded Rust internals via zoomerjoin) results <- match_companies( queries = queries, dictionary = dictionary, query_col = "company_name", dict_col = "company_name", unique_id_col = "query_id", dict_id_col = "orbis_id" ) print(results)
Standardizes company names by lowercasing, removing legal suffixes, translating characters to ASCII, and removing noise words.
normalize_company_name(x)normalize_company_name(x)
x |
A character vector of company names. |
A character vector of normalized names.
# Normalize a single company name normalize_company_name("BMW AG") normalize_company_name("Siemens GmbH & Co. KG") # Normalize multiple names companies <- c("Deutsche Bank AG", "VW Group", "BASF SE") normalize_company_name(companies)# Normalize a single company name normalize_company_name("BMW AG") normalize_company_name("Siemens GmbH & Co. KG") # Normalize multiple names companies <- c("Deutsche Bank AG", "VW Group", "BASF SE") normalize_company_name(companies)
Sends doubtful matches (not "Perfect" or "Unmatched") to an LLM for verification. Supports resuming from interruptions via chunk files.
validate_matches_llm( data, query_name_col, dict_name_col, output_dir = tempdir(), filename_stem = "match_validation", batch_size = 20, api_key = NULL, endpoint = NULL, deployment = NULL, engine = c("azure", "openai", "local") )validate_matches_llm( data, query_name_col, dict_name_col, output_dir = tempdir(), filename_stem = "match_validation", batch_size = 20, api_key = NULL, endpoint = NULL, deployment = NULL, engine = c("azure", "openai", "local") )
data |
Data frame. Must contain the columns specified by |
query_name_col |
String. Column containing the user's query name (Employer). |
dict_name_col |
String. Column containing the dictionary match name (Registry). |
output_dir |
String. Directory to save temporary chunks and final results. Defaults to |
filename_stem |
String. Base name for output files. |
batch_size |
Integer. Number of rows to process before saving a chunk. |
api_key |
String. API Key. Defaults to |
endpoint |
String. API Endpoint. Defaults to |
deployment |
String. Deployment or model name. Defaults to |
engine |
String. Either |
A data frame with added LLM_decision and LLM_reason columns.
## Not run: # Sample matched data matched_data <- data.frame( employer_name = c("BMW", "Siemens"), registry_name = c("BMW AG", "SAP SE"), dict_id = c("D001", "D002"), match_type = c("Fuzzy", "Fuzzy") ) # Validate using LLM (requires Azure credentials) validated <- validate_matches_llm( data = matched_data, query_name_col = "employer_name", dict_name_col = "registry_name" ) print(validated) ## End(Not run)## Not run: # Sample matched data matched_data <- data.frame( employer_name = c("BMW", "Siemens"), registry_name = c("BMW AG", "SAP SE"), dict_id = c("D001", "D002"), match_type = c("Fuzzy", "Fuzzy") ) # Validate using LLM (requires Azure credentials) validated <- validate_matches_llm( data = matched_data, query_name_col = "employer_name", dict_name_col = "registry_name" ) print(validated) ## End(Not run)