afireohno t1_iwjy4d8 wrote on November 16, 2022 at 4:31 AM

You could use a BK-tree.

Devinco001 OP t1_iwmdmgy wrote on November 16, 2022 at 6:38 PM

Thanks! I BK tree is much faster. While researching of BK trees, I found out symspell algorithm, which is even faster. So going to use that for now, because of its high accuracy and faster time.

afireohno t1_iwn5dcr wrote on November 16, 2022 at 9:39 PM

Glad it was helpful!

goedel777 t1_iwjxp76 wrote on November 16, 2022 at 4:27 AM

unique(incorrectly_spelled_words)

Devinco001 OP t1_iwjxuul wrote on November 16, 2022 at 4:28 AM

Yes, I have done that. It's after dropping the duplicates, the count is coming 10M

goedel777 t1_iwjxz5j wrote on November 16, 2022 at 4:30 AM

Without seeing the code it will be impossible to help here

Devinco001 OP t1_iwmebpe wrote on November 16, 2022 at 6:43 PM

Sure, but its just a for loop, looping through the words in the dictionary, and using a python library 'python-levenshtein' to calculate distance between the dictionary words and the mispelled word.

For now, I am skipping levenshtein for a faster approximate distance, using symspell algorithm. It is highly accurate and much faster. Reduced computation time from 21 days to 13 hours

threehofive t1_iwjy7i0 wrote on November 16, 2022 at 4:32 AM

Can you run stuff in parallel? Are you using the cloud? Run concurrent processes somehow, or if you already are, go even wider. No reason you can’t divide this into 10 separate 1 million word processes running at the same time on different machines. Not sure if you are doing that yet or no

Devinco001 OP t1_iwmez9k wrote on November 16, 2022 at 6:47 PM

Thanks, I am using a different algorithm now, symspell. But haven't used multi threading till now. A really good idea, would speed anything up several times

threehofive t1_iwmnxzc wrote on November 16, 2022 at 7:46 PM

Np gl

sameersoi t1_iwjye93 wrote on November 16, 2022 at 4:33 AM

First thing that popped into my mind was using a tree structure. A quick Google search (“suffix tree spellcheck”) led to this helpful article below that proposes the use of BK-trees (a tree that essential sorts child noes by Levenshtein edit distance). Crucially when doing a look up “Time Complexity will be O(L1L2log n), here L1 is the average length of word in our dictionary and L2 is the length of misspelled.”

https://www.geeksforgeeks.org/bk-tree-introduction-implementation/amp/

Devinco001 OP t1_iwmvton wrote on November 16, 2022 at 8:37 PM

Yeah, I googled BK tree. The data structure is amazing and reduces a lot of computational time. While searching that, I found another algorithm, symspell. This is even faster with high accuracy but doesn't use levenshtein, so currently going to use that.

But BK trees are the go to method if using pure levenshtein and similar, more accurate, string distances. So will be comparing the accuracy of both algos to choose the better one. Thanks

[deleted] t1_iwk2vop wrote on November 16, 2022 at 5:18 AM

[deleted]

Devinco001 OP t1_iwmv475 wrote on November 16, 2022 at 8:32 PM

Yeah, I looked at some LM at huggingface for filling the mask. They looked good, but required significant memory and computational resources to train.

This approach is the best, but due to resource constraints, I might have to fall back on simpler algorithms. Currently, I am going to use symspell. It is surprisingly highly accurate and fast.

But I will keep looking for a less resource hungry LM, since lookup time is low and they better catch the context and grammer. Using levenshtein and varying model output will increase its accuracy further many times. Ultimately, will be shifting to that, thanks

[deleted] t1_iwn8lv5 wrote on November 16, 2022 at 10:01 PM