6.3Applying Anomaly Detection TechnqiuesSo far in this Chapter, we have reduced the data into a more approachable data setrepresented by what we have identified to be key variables. We have also seen whatknown bots look like and compared these to the rest of the data. To see if we can findany anomalous data points that may be explained by bots, we will now carry out bothglobal and local anomaly detection techniques.6.3.1K-NN Anomaly DetectionLet’s first apply K-Nanomaly detection to the standardized data, using the K = 44nearest neighbour as 1965 ‰€ 44.
Doing this, we find the three data points which incurthe largest K-NN scores are the data points 345, 661505 and 727589. These three datapoints are the three anomalous data points that were seen as a collective anomaly in theK-means clustering of the data. However, as stated previously all of these data pointshave a high average time spent on a page per session and it is very unlikely that theseanomalous points are explained by bots. So instead, we look at the 50 data points whichscored the highest K-NN score, which are given by the red crosses in Figure 6.
7. Thisdiagram illustrates how the K-NN anomaly detection technique is a global method, aseach data point represented by a red cross does indeed seem to be isolated from therest of the data. We can also use this figure to match the collective anomaly found inK-means clustering with the three data points which are the furthest right in Figure6.7. If we investigate these 50 data points, it again turns out all of which have a veryhigh value for the average time (most being over 500 seconds), and are therefore unlikelyto be bots. In fact, none of these 50 data points are identified as known bots, whichindicates that they probably don’t have the required statistics to be classed as a bot.Figure 6.8: 50 data points which scored the highest K-NN score If we compare the 50 data points which scored the highest KNN score to the pointswhich were identified as outliers/noise in the DBSCAN algorithm, we find that 7 outof the 20 data points didn’t match. As we previously saw, global anomalies imply localanomalies but local anomalies do not imply global anomalies and this is what we haveseen here. It is unlikely that we are going to find anomalous points that can be explainedby bots from global anomalies, as for a global anomaly to be identified it has to be faraway from the rest of the data .In the model we are using, this implies a very largeaverage time. Therefore, to try and find some anomalous points that can be explainedby bots, we will now turn to local anomaly detection.6.3.2LOF Anomaly DetectionNow let’s calculate the LOF score for each of these 1965 data points. When calculatingthe LOF score we again need to specify the value of K, for which we will use K = 44, dueto the rule of thumb of using the square root of the number of data points we consider.In Chapter 5 we saw that any data point with a LOF score higher than 2 would meanthe point is ln an area of low density and therefore likely to be an anomalous point. Ifwe identify all these data points in which their LOF score is greater than 2, we find thatthere are 181 data points that are considered to be possible anomalous points. If we thenidentify these 181 points, we find that in fact none of the identified anomalous pointsare known bots. If we compare these data points with the 20 data points identifiedas noise/outliers by the DBSCAN algorithm, we find that only 3 of the 20 data pointsdidn’t feature in both lists. Note that the three data points which didn’t have a LOFscore greater than 2, all featured in the 50 data points with the highest KNN score,hence clearly being global anomalies. The plot of this data is shown in Figure 6.9 (a),where the red points indicate the data points which have a LOF score greater than 2.If we compare this plot with Figure 6.8, we can actually spot many differences. Both ofthe methods pick out the data points which are considered to be clear global anomalies,but the LOF method picks out a lot more data points within the main bulk of the data.This main bulk of the data has an average time closer to zero and hence is much morelikely to be anomalous data which could be explained by bots.Unfortunately, LOF anomaly detection does also pick up global anomalies, as we cansee in Figure 6.9(a) and as we have seen so far in this chapter it seems apparent thatwe are looking for local anomalies. Therefore, to try and compensate for this we willnow look for data point which give a LOF score greater than 10 as these points willalmost certainly be counted as local anomalies. It turns out there are 6 data points witha LOF score greater than 10, with one data point actually having a LOF score equal to22.58668, a lot larger than 2, the value at which a data points becomes suspicious. Ifwe plot these 6 values, again marked in red and shown in Figure 6.7 (b), we see thatthey are in the dense area of the data. These points have been identified as clear localanomalies in the data, all of which having a low average time per page visit, a relativelyhigh number of page visits per session and a history score around 0.5. They also seem tobe in a very similar region of the graph to which the known bots were plotted in Figure6.4. Earlier in this chapter, it was made apparent that the anomalies that were going to be explained by bots were going to come from local anomalies. We have now identified6 data points which are clear local anomalies and also correspond to behaviour that weknow is explained by bots. Hence, these 6 data points identified, are very likely localanomaly points which can be explained by bots.6.4SummaryIn this chapter, we have condensed a very large data set into a much more approachable,where we have identified the important variables. From analysing what known bots looklike, we condensed the data further by identifying what we were looking for and excludingdata which didn’t have the required features. Discarding some of the data means thatsome bots will go unidentified, but we purposely focused on the bots which would havethe biggest impact to the web page, meaning the bots that are likely to be left undetecteddidn’t consist of a high number of page visits and hence wouldn’t choke up the web page.When looking at the remaining data, we identified many global anomalies, however itwas pretty apparent from the definition of a bot that these wouldn’t have been causedby a bot. When investigating local anomalies, we identified 6 data points which weremuch more likely to be anomalies which could be explained by bots. Although withoutany verification, we can’t be sure that these anomalous points were bots, which againillustrates the difficulty of the problem.