基于TF-IDF和KNN的模糊字符串匹配优化
原创What & why Fuzzy String matching
模糊字符串匹配(Fuzzy string matching)是一种查找近似模式(而不是完全匹配)的技术。换句话说,模糊字符串匹配是一种搜索类型,即使用户拼错单词或仅输入部分单词进行搜索,也会找到匹配项。也称为近似字符串匹配(approximate string matching)。
语言是模棱两可的,指向同一事物的文本稍有不同,或者拼写错误。假设导航去机场,无论说“双流机场”还是“双流国际机场”,应该都指向“成都双流国际机场”这个官方正式名称。Expedia和Booking.com是国外两家再现旅行社,它们用自己方法命名其房间。例如,同一家酒店的同一种房间,Expedia 称为“Studio, 1 King Bed with Sofa bed, Corner”,而Booking.com描述为“Corner King Studio”。当我们能要比较OTA(Online Travel Agency)之间的房价,不同的描述会引起混乱。也就是说,如果要做一个价格比较程序,要解决的关键问题之一就是自动找出两个酒店房间是否是同一事物(标准间,豪华套房)。
Why not use FuzzyWuzzy?
当涉及模糊字符串匹配时通常采用FuzzyWuzzy。FuzzyWuzzy库基于Levenshtein距离方法,广泛用于计算字符串的相似度(距离)分数。但为什么不应该使用它呢?答案很简单:太慢了。原因是将每个记录与数据中的所有其他记录进行比较。随着数据大小的增加,执行模糊字符串匹配所需的时间将成倍增加。这种现象被称为二次时间复杂度。二次时间复杂度表示一种算法,其性能与输入数据的平方大小成正比
TF-IDF then KNN
TF-IDF的思想是,它将是数据的文档表示形式,而最匹配的候选对象的选择是使用KNN(K Nearest Neighbor)和余弦相似度而不是Levenshtein距离。基于个人理解,TF-IDF是一种word embedding技术,将文本条目映射到多维空间,而KNN使用基于KDTree或者BallTree的优化搜索树。
#Example RoomType
示例1是英文,基于 RoomType Kaggle数据 。数据如下。
fuzzy_tf_df 实现
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz, process
import re
import itertools
from typing import Union, List, Tuple
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
import jieba
import pickle
import time
def preprocess_string(s):
s = re.sub(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)', '', s)
return s
def build_vectorizer( clean: pd.Series, analyzer: str = 'char', ngram_range: Tuple[int, int] = (1, 4), n_neighbors: int = 1, **kwargs ) -> Tuple:
vectorizer = TfidfVectorizer(analyzer = analyzer, ngram_range = ngram_range, **kwargs)
X = vectorizer.fit_transform(clean.values.astype('U'))
nbrs = NearestNeighbors(n_neighbors = n_neighbors, metric = 'cosine').fit(X)
return vectorizer, nbrs
def tfidf_nn(
messy,
clean,
n_neighbors = 1,
**kwargs
# Fit clean data and transform messy data
vectorizer, nbrs = build_vectorizer(clean, n_neighbors = n_neighbors, **kwargs)
input_vec = vectorizer.transform(messy)
# Determine best possible matches
distances, indices = nbrs.kneighbors(input_vec, n_neighbors = n_neighbors)
nearest_values = np.array(clean)[indices]
return nearest_values, distances
# String matching - match fuzzy
def find_matches_fuzzy(
match_candidates,
limit = 5
row_matches = process.extract(
row, dict(enumerate(match_candidates)),
scorer = fuzz.token_sort_ratio,
limit = limit
result = [(row, match[0], match[1]) for match in row_matches]
return result
# String matching - TF-IDF
def fuzzy_nn_match(
messy,
clean,
column,
n_neighbors = 100,
limit = 5, **kwargs):
nearest_values, _ = tfidf_nn(messy, clean, n_neighbors, **kwargs)
results = [find_matches_fuzzy(row, nearest_values[i], limit) for i, row in enumerate(messy)]
df = pd.DataFrame(itertools.chain.from_iterable(results),
columns = [column, col, 'Ratio']
return df
def fuzzy_tf_idf( df: pd.DataFrame, column: str, clean: pd.Series,
mapping_df: pd.DataFrame, col: str, analyzer: str = 'char', ngram_range: Tuple[int, int] = (1, 3) ) -> pd.Series:
clean = clean.drop_duplicates().reset_index(drop = True)
messy_prep = df[column].drop_duplicates().dropna().reset_index(drop = True).astype(str)
messy = messy_prep.apply(preprocess_string)
result = fuzzy_nn_match(messy = messy, clean = clean, column = column, col = col, n_neighbors = 1)
return result
# Run the fuzzy string matching algorithm
start = time.time()
df_result = (df.pipe(fuzzy_tf_idf, # Function and messy data
column = 'Expedia', # Messy column in data
clean = df['Booking.com'], # Master data (list)
mapping_df = df, # Master data
col = 'Result') # Can be customized
end = time.time()
print('Fuzzy string matching in {} seconds'.format(end - start))
df_result.head()
基于FuzzyWuzzy的实现
def stringMatching( df: pd.DataFrame, column: str, clean: pd.Series, mapping_df: pd.DataFrame, col: str ):
categoryClean = clean.drop_duplicates().reset_index(drop = True)
categoryMessy = df[column].drop_duplicates().dropna().reset_index(drop = True).astype(str)
categoryFuzzy = {}
ratioFuzzy = {}
for i in range(len(categoryMessy)):
resultFuzzy = process.extractOne(categoryMessy[i].lower(), categoryClean)
catFuzzy = {categoryMessy[i]:resultFuzzy[0]}
ratFuzzy = {categoryMessy[i]:resultFuzzy[1]}
categoryFuzzy.update(catFuzzy)
ratioFuzzy.update(ratFuzzy)
catCol = col
ratCol = 'Ratio'
df[catCol] = df[column]
df[ratCol] = df[column]
df[catCol] = df[catCol].map(categoryFuzzy)
df[ratCol] = df[ratCol].map(ratioFuzzy)
return df
# Run the fuzzy string matching algorithm
start = time.time()
df_result = (df.pipe(stringMatching, # Function and messy data
column = 'Expedia', # Messy column in data
clean = df['Booking.com'], # Master data (list)