Comments

You must log in or register to comment.

afireohno t1_iwjy4d8 wrote

You could use a BK-tree.

3

Devinco001 OP t1_iwmdmgy wrote

Thanks! I BK tree is much faster. While researching of BK trees, I found out symspell algorithm, which is even faster. So going to use that for now, because of its high accuracy and faster time.

2

goedel777 t1_iwjxp76 wrote

unique(incorrectly_spelled_words)

2

Devinco001 OP t1_iwjxuul wrote

Yes, I have done that. It's after dropping the duplicates, the count is coming 10M

1

goedel777 t1_iwjxz5j wrote

Without seeing the code it will be impossible to help here

3

Devinco001 OP t1_iwmebpe wrote

Sure, but its just a for loop, looping through the words in the dictionary, and using a python library 'python-levenshtein' to calculate distance between the dictionary words and the mispelled word.

For now, I am skipping levenshtein for a faster approximate distance, using symspell algorithm. It is highly accurate and much faster. Reduced computation time from 21 days to 13 hours

0

threehofive t1_iwjy7i0 wrote

Can you run stuff in parallel? Are you using the cloud? Run concurrent processes somehow, or if you already are, go even wider. No reason you can’t divide this into 10 separate 1 million word processes running at the same time on different machines. Not sure if you are doing that yet or no

2

Devinco001 OP t1_iwmez9k wrote

Thanks, I am using a different algorithm now, symspell. But haven't used multi threading till now. A really good idea, would speed anything up several times

1

sameersoi t1_iwjye93 wrote

First thing that popped into my mind was using a tree structure. A quick Google search (“suffix tree spellcheck”) led to this helpful article below that proposes the use of BK-trees (a tree that essential sorts child noes by Levenshtein edit distance). Crucially when doing a look up “Time Complexity will be O(L1L2log n), here L1 is the average length of word in our dictionary and L2 is the length of misspelled.”

https://www.geeksforgeeks.org/bk-tree-introduction-implementation/amp/

2

Devinco001 OP t1_iwmvton wrote

Yeah, I googled BK tree. The data structure is amazing and reduces a lot of computational time. While searching that, I found another algorithm, symspell. This is even faster with high accuracy but doesn't use levenshtein, so currently going to use that.

But BK trees are the go to method if using pure levenshtein and similar, more accurate, string distances. So will be comparing the accuracy of both algos to choose the better one. Thanks

2

[deleted] t1_iwk2vop wrote

[deleted]

1

Devinco001 OP t1_iwmv475 wrote

Yeah, I looked at some LM at huggingface for filling the mask. They looked good, but required significant memory and computational resources to train.

This approach is the best, but due to resource constraints, I might have to fall back on simpler algorithms. Currently, I am going to use symspell. It is surprisingly highly accurate and fast.

But I will keep looking for a less resource hungry LM, since lookup time is low and they better catch the context and grammer. Using levenshtein and varying model output will increase its accuracy further many times. Ultimately, will be shifting to that, thanks

1