Viewing a single comment thread. View all comments

hjmb t1_isg9v1h wrote

Fuzzy matching will help with the typos, but from experience we crafted nicknames by hand.

If your jurisdiction(s) have accessible company records then you could match on those names to determine which rows are official names. This solves half your problem, as you then just need to match the remaining rows to an accepted official name.

You could also modify Levenshtein distance so that dropping characters is free in an attempt to match full names with shorter names, but this will be computationally expensive.

4

Super-Martingale OP t1_isgacv9 wrote

In the past, I did fuzzy matching plus a manual selection for smaller lists like a few thousand strings. But for millions of rows, this is just impossible. So we are wondering whether AI-based approaches can help.

2

hjmb t1_isgaxow wrote

I would be wary - AI approaches tend to give you plausible answers, not true answers. Also it may be worth updating your post to make it clear that you're looking for AI solutions to your problem, rather than looking for data cleaning advice for a dataset that you are going to feed into a machine learning system (which is what I inferred)

1

Super-Martingale OP t1_isgey5g wrote

There is definitely a tradeoff between accuracy and efficiency. We are not sure which approach would be better, so want to keep the discussion broad.

1