This is the test class code that I need help with:
@Test
public void testTask3example()
{
MarkovModel model = new MarkovModel(2,”aabcabaacaac”);
ModelMatcher match = new ModelMatcher(model,”aabbcaac”);
assertEquals(0,1); //TODO replace with test code
}
}
The details for this task and code are below:
Task 3 – Matching test strings to a model (ModelMatcher)
Link to task 1: https://www.chegg.com/homework-help/questions-and-answers/task-1-analysing-n-grams-sample-text-ngramanalyser-task-need-complete-ngramanalyser-class–q22251885
Link to task 2: https://www.chegg.com/homework-help/questions-and-answers/task-2-building-markov-model-text-sample-markovmodel-task-need-complete-markovmodel-class–q22255515
For this task, you will need to complete the ModelMatcher class, which determines which of two Markov models better matches a test string. Given two Markov models built from text samples taken from two different sources, we can use them to estimate which source it is more likely a test string was drawn from. Even using only zero-th order models constructed using English and Russian text, we should be able to tell, for instance, that the string
“Борщ безвкусен”
is more likely to be Russian than English.
We will computer a measure of fit of test strings against models, called the likelihood of a sequence under the model. The likelihood of a sequence under a kk-th order Markov model is calculated as follows:
For each symbol c in the sequence, compute the probability of observing c under the model, given its k-letter context p (assuming the sequence to “wrap around”, as described above), using the Laplace-smoothed estimate of probability we used for the MarkovModel class.
Compute the likelihood of the entire sequence as the product of the likelihoods of each character.
In mathematical notation:
Let ss be an input sequence of length nn, and let MM be a kk-th order Markov model. In order to calculate the likelihood of the sequence ss under the model MM, for each symbol cici in ss (where 1≤i≤n1≤i≤n), let pipi be the kk-length context of the symbol cici (assuming wraparound). The likelihood of the sequence ss under the model is
n∏i=1laplace(ci)∏i=1nlaplace(ci)
where laplacelaplace is the Laplace-smoothed probability of cici occurring (given its context) as described in the previous task.
The probability we obtain from this calculation may be very small – in fact, potentially so small as to be indistinguishable from zero when using Java’s built-in floating-point arithmetic. Therefore, we will calculate and express the likelihood using log probabilities, which do not suffer from this problem. (A weakness of log probabilities is that they cannot straightforwardly represent probabilities of zero, but our use of Laplace smoothing allows us to avoid this problem.) The product of two probabilities pp and qq can be calculated by adding their log probabilities, logplogp and logqlogq.
By way of example, suppose we have constructed a 2nd-order Markov model using the input string “aabcabaacaac”, as described in the example for Task 2. If we are then given a test string “aabbcaac”, we can compute its log likelihood as follows:
For each character in the test string, obtain its length-2 context (assuming wraparound). Note that the tri-gram “caa” occurs twice in the (wrapped) test string.
Context | Character | Frequency |
---|---|---|
“aa” | ‘b’ | 1 |
“ab” | ‘b’ | 1 |
“bb” | ‘c’ | 1 |
“bc” | ‘a’ | 1 |
“ca” | ‘a’ | 2 |
“aa” | ‘c’ | 1 |
“ac” | ‘a’ | 1 |
For each character in the test string, calculate its Laplace-smoothed probability of occurring (given its context), and then take the log of this.
Context | Character | Laplace estimate | log10log10 of laplace estimate |
---|---|---|---|
“ac” | ‘a’ | 2+12+32+12+3 | -0.2218 |
“aa” | ‘c’ | 2+13+32+13+3 | -0.3010 |
“bc” | ‘a’ | 1+11+31+11+3 | -0.3010 |
“aa” | ‘b’ | 1+13+31+13+3 | -0.4771 |
“bb” | ‘c’ | 0+10+30+10+3 | -0.4771 |
“ca” | ‘a'(2x) | 2+13+32+13+3 | -0.6021 |
“ab” | ‘b’ | 0+12+30+12+3 | -0.6990 |
Then
total log likelihood = sum of log probabilities = -3.0792, and
average log likelihood = sum of log probabilities = -0.3849
If we have two Markov models, and one test string, then we can select the best model as being the one that maximizes the likelihood of the string under the model – i.e. we choose the model with the higher average log likelihood. Notice that the example above lists ngrams from that most in agreement with the model (highest log likelihood) to the least likely. This helps us to explain why a particular test string matches a model or not.
You will need to write code in the ModelMatcher class that takes as input one Markov model and a “test string” for which we want to identify how well it matches the given model. Complete the ModelMatcher class as follows. You may wish to complete sub-task (c) before sub-task (b).
Fill in the @author and @version fields in the header comments with all authors’ names, student numbers and the date.
Complete the constructor, which calculates and sets the fields in the class. Getter methods for these fields are already complete.
Complete the totalLogLikelihood method, a private helper method which sums a map containing log likelihoods for strings, and the averageLogLikelihood helper method, which calculates the average.
Complete the toString method for the ModelMatcher class. This method should give a summary of the results. You may choose how to arrange the string, but it should include at least the name, averageLogLikelihood and a table of the loglikelihood probabilities as shown in the example above.
Ensure you complete Javadoc comments for each method and class including @param, @returns and @throws fields as required.
import java.util.HashMap;
import java.util.Collection;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Map;
import java.util.Set;
public class ModelMatcher
{
/** log likelihoods for a teststring under a given model */
private HashMap<String,Double> logLikelihoodMap;
/** summary statistic for this setting */
private double averageLogLikelihood;
/**
* Constructor to initialise the fields for the log likelihood map for
* a test string and a given Markov model and
* the average log likelihood summary statistic
* @param MarkovModel model a given Markov model object
* @param String teststring
*/
public ModelMatcher(MarkovModel model, String testString) {
int modelOrder = model.getK();
this.logLikelihoodMap = new HashMap<String, Double>();
double laplaceEstimate;
double logLikelihood;
String sequence;
NgramAnalyser stringNgram = new NgramAnalyser(modelOrder + 1,testString);
Set<String> distinctNgrams = stringNgram.getDistinctNgrams();
for (String ngram : distinctNgrams) {
laplaceEstimate = model.laplaceEstimate(ngram);
//Use change of base formula to find log(10) likelihood
logLikelihood = Math.log(laplaceEstimate) / Math.log(10);
this.logLikelihoodMap.put(ngram, logLikelihood);
}
this.averageLogLikelihood =
this.averageLogLikelihood(this.logLikelihoodMap, stringNgram.getNgramCount());
}
/** Helper method that calculates the average log likelihood statistic
* given a HashMap of strings and their Laplace probabilities
* and the total number of ngrams in the model.
*
* @param logs map of ngram strings and their log likelihood
* @param ngramCount int number of ngrams in the original test string
* @return average log likelihood: the total of loglikelihoods
* divided by the ngramCount
*/
private double averageLogLikelihood(HashMap<String,Double> logs, int ngramCount) {
double logSum = this.totalLogLikelihood(logs);
return (logSum / ngramCount);
}
/** Helper method to calculate the total log likelihood statistic
* given a HashMap of strings and their Laplace probabilities
* and the total number of ngrams in the model.
*
* @param logs map of ngram strings and their log likelihood
* @return total log likelihood: the sum of loglikelihoods in logs
*/
private double totalLogLikelihood(HashMap<String,Double> logs) {
double logSum = 0;
for (Map.Entry<String, Double> entry : logs.entrySet()) {
logSum += entry.getValue();
}
return (logSum);
}
/**
* @return the average log likelihood statistic
*/
public double getAverageLogLikelihood() {
return averageLogLikelihood;
}
/**
* @return the log likelihood value for a given ngram from the input string
*/
public double getLogLikelihood(String ngram) {
return (logLikelihoodMap.get(ngram));
}
/**
* Make a String summarising the log likelihood map and its statistics
* @return Header lines containing the testString, the averageLogLikelihood
* and a sorted table of ngrams and their loglikelihoods
* The likelihood table should be ordered from highest to lowest likelihood
*/
public String toString() {
String returnString = “”;
returnString = returnString + Double.toString(this.averageLogLikelihood);
returnString = returnString + NgramAnalyser.printHashMap(this.logLikelihoodMap);
return returnString;
}
}
ProjectTest.java
import static org.junit.Assert.*;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
import java.util.Set;
public class ProjectTest
{
/**
* Default constructor for test class ProjectTest
*/
public ProjectTest() {
}
/**
* Sets up the test fixture.
*
* Called before every test case method.
*/
@Before
public void setUp() {
}
/**
* Tears down the test fixture.
*
* Called after every test case method.
*/
@After
public void tearDown() {
}
//TODO add new test cases from here include brief documentation
/**
* Test if the number of lines in a string output from Ngram.toString()
* is valid (i.e equal to the size of the alphabet of that Ngram)
* Also ensures that the sort, splice and constructor functions work
* as required to produce the required comparison
*/
@Test(timeout=1000)
public void testSensibleToStringSize() {
String[] stringsToTest = {“Hello my friend”,
“be”,
“Have a nice day you filthy animal”,
“asdfghjkl$$sdfghj%%”,
“2”,
“adadadadaaaaa”,
” “};
Integer[] ngramSizesToTest = {1, 2, 3, 4, 5};
NgramAnalyser analysis;
String analysisString;
int i = ngramSizesToTest[0];
String s = stringsToTest[5];
if (i > s.length()) {
try {
analysis = new NgramAnalyser(i, s);
} catch (IllegalArgumentException e) {
assertEquals(0, 0);
}
} else {
analysis = new NgramAnalyser(i, s);
analysisString = analysis.toString();
//Number of lines is equal to the number of n’s plus 1
int numberofLines = analysisString.length() –
analysisString.replace(“n”, “”).length() + 1;
assert(numberofLines >= analysis.getAlphabetSize());
}
}
/**
* Tests various aspects of the getDistinctNgrams function
* inlcuding set length, …
*/
@Test(timeout=1000)
public void testGetDistinctNgrams() {
String[] stringsToTest = {
“123!@#123!@#”,
“adadadadadadadad”,
“cadadcdaadcdbed”,
“aaaaaa”,
“HOWWEYVUFXBINEF”
};
String stringToTest = stringsToTest[0];
int ngramSize = 2;
NgramAnalyser analysis = new NgramAnalyser(ngramSize, stringToTest);
Set<String> distinctNgrams = analysis.getDistinctNgrams();
int distinctNgramCount = analysis.getDistinctNgramCount();
int totalNgramCount = analysis.getNgramCount();
//Test that there are fewer or equal distinct Ngrams than total Ngrams
assert(distinctNgramCount <= totalNgramCount);
//Test that the alphabet size is smaller than
//or equal to the number of distinct NGrams
assert(analysis.getAlphabetSize() <= distinctNgramCount);
}
@Test(timeout=1000)
public void testLaplaceExample() {
assertEquals(0,1); //TODO replace with test code
}
@Test(timeout=1000)
public void testSimpleExample() {
assertEquals(0,1); //TODO replace with test code
}
@Test
public void testTask3example()
{
MarkovModel model = new MarkovModel(2,”aabcabaacaac”);
ModelMatcher match = new ModelMatcher(model,”aabbcaac”);
assertEquals(0,1); //TODO replace with test code
}
Expert Answer
import java.util.HashMap;
import java.util.Collection;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.SortedSet;
import java.util.TreeSet;
public class ModelMatcher
{
/** log likelihoods for a teststring under a given model */
private HashMap<String,Double> logLikelihoodMap;
/** summary statistic for this setting */
private double averageLogLikelihood;
/**
* Constructor to initialise the fields for the log likelihood map for
* a test string and a given Markov model and
* the average log likelihood summary statistic
* @param MarkovModel model a given Markov model object
* @param String teststring the string to check compatability with the model
*/
public ModelMatcher(MarkovModel model, String testString)
{
logLikelihoodMap = new HashMap<String, Double>();
int kVal = model.getK();
String seq;
for (int curPos = 0; (curPos) < testString.length(); curPos++)
{
if (curPos + (kVal + 1) <= testString.length())
{
seq = testString.substring(curPos, (curPos + kVal + 1));
}
else
{
String fromEnd = testString.substring(curPos, (testString.length()));
String fromStart = testString.substring(0, ((kVal + 1) – fromEnd.length()));
seq = fromEnd + fromStart;
}
// seq should be of length k+1
// context should be the first k chars of seq
String context = seq.substring(0, kVal);
// impChar should be the last character of seq
String impChar = seq.substring(kVal, seq.length());
double loggedProb = Math.log10(model.laplaceEstimate(seq));
if (logLikelihoodMap.containsKey(seq))
{
logLikelihoodMap.put(seq, (logLikelihoodMap.get(seq) + loggedProb));
}
else
{
logLikelihoodMap.put(seq, loggedProb);
}
}
averageLogLikelihood = averageLogLikelihood(logLikelihoodMap, testString.length());
}
/** Helper method that calculates the average log likelihood statistic
* given a HashMap of strings and their Laplace probabilities
* and the total number of ngrams in the model.
*
* @param logs map of ngram strings and their log likelihood
* @param ngramCount int number of ngrams in the original test string
* @return average log likelihood: the total of loglikelihoods
* divided by the ngramCount
*/
private double averageLogLikelihood(HashMap<String,Double> logs, int ngramCount)
{
double totalLogs = 0.0;
for (String i : logs.keySet())
{
totalLogs += logs.get(i);
}
double avg = totalLogs/ngramCount;
return avg;
}
/** Helper method to calculate the total log likelihood statistic
* given a HashMap of strings and their Laplace probabilities
* and the total number of ngrams in the model.
*
* @param logs map of ngram strings and their log likelihood
* @return total log likelihood: the sum of loglikelihoods in logs
*/
private double totalLogLikelihood(HashMap<String,Double> logs)
{
double totalLogs = 0.0;
for (String i : logs.keySet())
{
totalLogs += logs.get(i);
}
return totalLogs;
}
/**
* @return the average log likelihood statistic
*/
public double getAverageLogLikelihood()
{
return averageLogLikelihood;
}
/**
* @param ngram String the ngram to find the log likelihood of.
* @return the log likelihood value for a given ngram from the input string
*/
public double getLogLikelihood(String ngram)
{
return (logLikelihoodMap.get(ngram));
}
/**
* Make a String summarising the log likelihood map and its statistics
* @return String of ngrams and their loglikeihood differences between the models
* The likelihood table should be ordered from highest to lowest likelihood
*/
public String toString()
{
SortedSet<String> keysArray = new TreeSet<String>(logLikelihoodMap.keySet());
String toRet = “”;
for (String key : keysArray)
{
double logLike = logLikelihoodMap.get(key);
String logLikeS = Double.toString(logLike);
String thisKey = (key + ” ” + logLikeS + “n”);
toRet += thisKey;
}
return toRet;
}
}
projectTest.java
import static org.junit.Assert.*;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
/**
* The test class ProjectTest for student test cases.
* Add all new test cases to this task.
*/
public class ProjectTest
{
/**
* Default constructor for test class ProjectTest
*/
public ProjectTest()
{
}
/**
* Sets up the test fixture.
*
* Called before every test case method.
*/
@Before
public void setUp()
{
}
/**
* Tears down the test fixture.
*
* Called after every test case method.
*/
@After
public void tearDown()
{
}
//TODO add new test cases from here include brief documentation
@Test(timeout=1000)
public void testSensibleToStringSize() {
NgramAnalyser analyser = new NgramAnalyser(“aabcabfaacfaaac”);
int minLines = analyser.getAlphabetSize() + 1;
//This next bit is ridiculously inefficient memory wise but I’m lazy so there.
String[] lines = analyser.toString().split(“rn|r|n”);
int linesLength = lines.length;
//System.out.println(linesLength);
//System.out.println(minLines);
assertTrue(linesLength >= minLines);
}
@Test(timeout=1000)
public void testGetDistinctNgrams() {
NgramAnalyser analyser = new NgramAnalyser(“aaa”);
NgramAnalyser analyser1 = new NgramAnalyser(3, “abc”);
NgramAnalyser analyser2 = new NgramAnalyser(3, “bcbabbcbc”);
NgramAnalyser analyser3 = new NgramAnalyser(2, “baba”);
assertEquals(analyser.getDistinctNgrams().size(),1);
assertEquals(analyser1.getDistinctNgrams().size(),3);
assertEquals(analyser2.getDistinctNgrams().size(),6);
assertEquals(analyser3.getDistinctNgrams().size(),2);
}
@Test(timeout=1000)
public void testLaplaceExample() {
MarkovModel mkMdl = new MarkovModel(2, “aabcabaacaac”);
double c = mkMdl.laplaceEstimate(“aac”);
double b = mkMdl.laplaceEstimate(“aab”);
double a = mkMdl.laplaceEstimate(“aaa”);
//System.out.println(c);
assertTrue(c >= 0.4999 && c <= 0.5001);
assertTrue(b >= 0.3332 && b <= 0.3334);
assertTrue(a >= 0.1666 && a <= 0.1668);
}
@Test(timeout=1000)
public void testSimpleExample() {
MarkovModel mkMdl = new MarkovModel(2, “aabcabaacaac”);
double b = mkMdl.simpleEstimate(“aab”);
//System.out.println(b);
assertTrue(b == (1.0/3.0));
}
@Test
public void testTask3example()
{
MarkovModel model = new MarkovModel(2,”aabcabaacaac”);
ModelMatcher match = new ModelMatcher(model,”aabbcaac”);
//Precision issues are amazing…
assertEquals(-0.3849, match.getAverageLogLikelihood(), 0.0001);
}
}