IN - N-gram Matching
This method is based on counting the number of matching n-grams between words, and can be enhanced through the use of clustering techniques. It is relatively immune to minor spelling problems. The steps are as follows:
- For each word, expand it to its set of n-grams (e.g., bigrams).
- Compute a word-word similarity matrix, using Dice's coefficient to measure the number of matching n-grams between each word pair.
- Form a single-link clustering of words using the matrix.
- Identify stem for each cluster of words with same prefix.