Using Matlab
In linguistics, stemming is the process of reducing inflected words to their word stem, base, or root form. In this assignment, you are to write a simple word stemmer for English. The input is given a string text that may have punctuations or other non-alphabetical characters. Your program should stem the words in the text and and return these words as a cell array.
Here are the steps your program should perform to derive and filter the word stems:
Convert any upper case letter to lower case.
Replace each non-alphabetical or non-space character to a space character. e.g., “My 1st NLP program!!!” should become: “my st nlp program ”
Extract the words from the string. e.g., “my st nlp program ” will result in the list: “my”, “st”, “nlp”, and “program”.
Strip the following suffixes from the words that have them: -ly, -ed, -ing, -es, -s. Each suffixes should be considered once and in that order (first strip -ly, then strip -ed, then strip -ing, etc.). e.g., the word “excitedly” turns into “excit”; the word “feeding” turns into “feed”.
Remove any word from the list that is 2 characters or less.
Remove the following common words from the list: the, and, that, have, for, not
Note that the stemming strategies used in this program are over-simplistic and may not give sensible results.
>> simplestemmer( 'Learning never exhausts the mind.' ) ans = { 'learn' 'never' 'exhaust' 'mind' } >> simplestemmer( 'Simplicity is the ultimate sophistication.' ) ans = { 'simplicity' 'ultimate' 'sophistication' }
Expert Answer
% simplestemmer.m
%function
function z1 = simplestemmer(x1)
%convert into lower case letter
x1 = lower(x1)
%delete non alphet data
for i = 1:length(x1)
if ~((x1(i) >= ‘a’ && x1(i)<=’z’) || (x1(i) >= ‘A’ && x1(i)<=’Z’))
x1(i)= ‘ ‘;
end
end
%split the particular sentences
y1 = strsplit(x1);
z1 = [];
%remove the value
for k = 1:numel(y1)
y1(1,k) = regexprep(y1(1,k), ‘(s|ing|es|ly|ed)$’, ”);
%certain word is deleted
if numel(regexp(y1(1,k),’that|for|the|not|and|have’){1})== 0 && length(y1(1,k){1}) > 2
z1 = [z1, y1(1,k){1},’ ‘];
end
end
end
%main.m
simplestemmer(‘Simplicity is the ultimate sophistication’)