Submitted by Super-Martingale t3_y4w0sw in MachineLearning

We are trying to standardize a long list (in millions) of company name strings. The same company can show up in different rows because of abbreviations, nicknames, subsidiaries, business units, typos, etc. So we need a way to group rows based on whether they are the same company. Given the size of our data, is there any good way to process the standardization efficiently?

Below is an example in which all strings should be grouped as a single company:

JPMorgan Chase & Co.
JPMorgan Chase
JPM Chase
JPM
J.P. Morgan
The JPM Company
Global Technology at JPMorgan Chase
JPM Company
JPM Chase Bank
JPM CHASE
JP Morgan Chase
J.P. Morgan Asset Management
JPMorgan Chase Bank, N.A.
JPMorgan
JPMorganChase
J.P. Morgan Chase
JPMorgan Chase Bank
J.P. Morgan Private Bank
InstaMed, a J.P. Morgan company
J.P. Morgan Chase Bank, N.A.
JPMorgan Private Bank
JP Morgan Asset Management
Jpmorgan Chase Bank National Association
J.P. Morgan Retirement Plan Services
JPMorgan Retirement Plan Services
JPMorgan Chase & Company
JP Morgan Chase (formerly Washington Mutual)
Washington Mutual/JP Morgan Chase
J.P. Morgan Investment Bank
JPMorgan Chase (formerly WaMu)
JPMorgan Chase Commercial Banking
JP Morgan Chase & NSPCC
JP Morgan Chase / Bank One
JP Morgan & Company Real Estate Appraisers And Con
WaMu/JPMorgan Chase
JP Morgan & Chase Co. (Formerly Washington Mutual
Bank One (JP Morgan Chase)

​

7

Comments

You must log in or register to comment.

hjmb t1_isg9v1h wrote

Fuzzy matching will help with the typos, but from experience we crafted nicknames by hand.

If your jurisdiction(s) have accessible company records then you could match on those names to determine which rows are official names. This solves half your problem, as you then just need to match the remaining rows to an accepted official name.

You could also modify Levenshtein distance so that dropping characters is free in an attempt to match full names with shorter names, but this will be computationally expensive.

4

Super-Martingale OP t1_isgacv9 wrote

In the past, I did fuzzy matching plus a manual selection for smaller lists like a few thousand strings. But for millions of rows, this is just impossible. So we are wondering whether AI-based approaches can help.

2

hjmb t1_isgaxow wrote

I would be wary - AI approaches tend to give you plausible answers, not true answers. Also it may be worth updating your post to make it clear that you're looking for AI solutions to your problem, rather than looking for data cleaning advice for a dataset that you are going to feed into a machine learning system (which is what I inferred)

1

Super-Martingale OP t1_isgey5g wrote

There is definitely a tradeoff between accuracy and efficiency. We are not sure which approach would be better, so want to keep the discussion broad.

1

Null-value0 t1_ishp3pi wrote

This is a classic MDM (master data mgmt) problem while creating a crm system. If it’s US only u can start with D&B hierarchies esp if the companies are public. For many small businesses or starts c&b records don’t exist. There are tools like informatica / tibco that play in here.

4

Super-Martingale OP t1_ishqkx4 wrote

Yes, most companies in our list are in the US, but the majority of them are privately held companies. Does tools like informatica / tibco charge tons of money?

1

CremeEmotional6561 t1_ishxulj wrote

The same problem exists with music artists. One solution is not to repeat the work that has already been done by others.

  1. Scrape https://www.chartsurfer.de/archiv/artist-a.html for a list of all artist names.

  2. Scrape https://www.discogs.com/de/artist/28795-Prince for a list of alias names for each artist.

  3. Do just simple Levensthein for spelling errors and prompt the user if in doubt.

3

tmarkovich t1_ishy19m wrote

It's worth noting that many of these are different companies, or subdivisions of JPM. Depending on the application, that matters a lot.

3

hellrail t1_isjxlxf wrote

U need to find a method tobturn these names into a feature vector, such that in feature space similar names ate clustered together naturally. Start with standard string similarities to get the feature vector, if that does not result in sufficiently unambigious cluster formations proceed by lemmatization methods and if it still is not sufficient try out some prelearned mod ls to generate the feature encoding

2

Alluvium t1_isj7tk3 wrote

So no. You can’t. This problem is solvable by having the correct data entered into your system.

You are lucky since they all in the US so mostly English I assume…

But the reality is that because you have a mix of different businesses under the same name (like you have listed here) the only person who can untangle this is the owner of the company which is named…

There are some cool ML tips you can find to reduce the complexity of this problem - but you will not ever solve it at the output stage -

This must be solved at input.

1

visarga t1_isozvm1 wrote

<offtopic> Where do I get a large-ish list of company names? Also, product names. </>

1

manimino t1_ish0q75 wrote

TF-IDF on letters. Not a lot of companies will have capital J and capital P in the name.

0

gambs t1_isixp6u wrote

Gonna be honest, that sounds like a horrible idea

4