Submitted by Super-Martingale t3_y4w0sw in MachineLearning
We are trying to standardize a long list (in millions) of company name strings. The same company can show up in different rows because of abbreviations, nicknames, subsidiaries, business units, typos, etc. So we need a way to group rows based on whether they are the same company. Given the size of our data, is there any good way to process the standardization efficiently?
Below is an example in which all strings should be grouped as a single company:
JPMorgan Chase & Co.
JPMorgan Chase
JPM Chase
JPM
J.P. Morgan
The JPM Company
Global Technology at JPMorgan Chase
JPM Company
JPM Chase Bank
JPM CHASE
JP Morgan Chase
J.P. Morgan Asset Management
JPMorgan Chase Bank, N.A.
JPMorgan
JPMorganChase
J.P. Morgan Chase
JPMorgan Chase Bank
J.P. Morgan Private Bank
InstaMed, a J.P. Morgan company
J.P. Morgan Chase Bank, N.A.
JPMorgan Private Bank
JP Morgan Asset Management
Jpmorgan Chase Bank National Association
J.P. Morgan Retirement Plan Services
JPMorgan Retirement Plan Services
JPMorgan Chase & Company
JP Morgan Chase (formerly Washington Mutual)
Washington Mutual/JP Morgan Chase
J.P. Morgan Investment Bank
JPMorgan Chase (formerly WaMu)
JPMorgan Chase Commercial Banking
JP Morgan Chase & NSPCC
JP Morgan Chase / Bank One
JP Morgan & Company Real Estate Appraisers And Con
WaMu/JPMorgan Chase
JP Morgan & Chase Co. (Formerly Washington Mutual
Bank One (JP Morgan Chase)
​
alheimur_zh t1_ishb2cl wrote
https://github.com/dedupeio/dedupe
https://github.com/Living-with-machines/DeezyMatch