Submitted by hasiemasie t3_10qv7r7 in MachineLearning

I’m looking to some aggregation on academic research and news articles to see what insights I get from it. I’m using textrazor to do named entity recognition on the documents, but getting a lot of dirty labels that have slightly different wording. For example, Tesla, Tesla ltd, Tesla Ltd. As a result, my aggregations have a lot of duplicate results.

The dataset consists of about 4M labels so the solution has to be efficient to be viable. I was thinking of putting the labels through word2vec and then clustering them based on the word embedding distances? But then the problem arises of how many clusters to use?

I’ve also tried simple regex preprocessing to get rid of the company abbreviations but there are other examples that cannot be solved that easily.

1

Comments

You must log in or register to comment.

wind_dude t1_j6sj0ix wrote

I solved a similar issue by building a knowledge graph. It took some manual curation and starting with a good base, but suggestions for misspelling and alternates were suggested by comparing vectors. The suggester runs as a batch with new entities after my ETL batch is done.

3

Blutorangensaft t1_j6uiygz wrote

Disclaimer: no help, more a request

Once you're done with this project, would you mind sharing your speed and accuracy? I'm kind of on the lookout for a good English NER model. Problem is, spacy has some issues with casing.

1

sad_potato00 t1_j6w92uy wrote

so we had a similar problem, where buidling names were written in diffrent ways (some abbreviation, full name, full name + what type of it). something that worked for me was using sentence BERT and doing a cosine similarity. deciding a cut off value was easier than deciding how many cluster to use. sadly, manuall labeling and checking is still needed

1