Build a Dictionary with Corpus Unigrams and Bigrams
(Natural Language Processing Assignment)
TL;DR: Scraped text, computed unigrams and bigrams to create a spellchecker UI
Purpose
This is a group assignment of 5 members. We are data engineering students.
Our task is to obtain a corpus (link to notebook), pre-process it (link to notebook), then use the unigrams and bigrams to build a spellchecker.
The dictionary is supposed to detect non-word errors using the unigrams. If a word does not exist in the unigram list, then it is non-word error. Suggestions from the unigram list are given based on edit distance of less than or equal 2.
Real word errors are like two words that exist in the unigram list, but are not used together. When detected, probabilities of other two-word combinations (a.k.a bigrams) are suggested.
The corpus text used is this website.
Originally, the UI was built using the python eel library. I wasn't sure if I can use web app for the assignment. It was slow, so slow that I had to use MongoDB. But, now with JavaScript, I directly call the unigrams and bigrams stored in json files in the project repo using gitraw.
Special thanks to my groupmates, Nicole, Pui Chyi, Yuen Neng and Lucas!