All the previous chapters have been in preparation for this last chapter where we in-
vestigate a data set to try and identify bots. A bot [20] is a software application that
automatically performs simple and structurally repetitive tasks over the internet at much
higher rates than humanly possible.
Bots were initially introduced to perform important tasks on the internet, especially
in relation to operating search engines. However, now around 66% of internet bots
are considered to be malicious [20]. These bots can do multiple things when visiting a
website, including scraping personal information or making the website unavailable to
users by generating large volumes of traffic against the target.
Due to the dangers these
bots can create, it has become an essential task to try and identify these bots and stop
them from visiting your webpage. When malicious bots were first starting to be used,
they were relatively easy to identify due to their low levels of sophistication. However,
now bots are much more advanced and often act in identical ways to that of a human
[20], making them almost impossible to detect .
Hence in our analysis, we have to accept
that we are very unlikely to find all the bots within the data set, but detecting any will
be a big step in protecting the website from future attacks.
6.1Initial Data Analysis
The data we will be looking at was provided by a local company, Clicksco and contains
information about the visits to a webpage on the day of 2018-06-10. The data set contains
1,048,575 pieces of data, in which each one contains 17 variables of information. The
variables include information such as session ID (a unique identifier for the browsing
session), start and end time of each session, the device type the web page was visited on
and whether or not that profile is known to be a bot. The final goal is to try and isolate
the key variables when identifying a bot and then use these to find the bots within the
data set.
From the initial variables, we plot the history score, device type and end time type in
Figure 6.1. We then use some of the initial variables to create 3 new variables: average
time spent on a web page, number of web pages visited per session and the hour in which the session first started. The plots of these three variables are also shown in Figure 6.1.
Looking at Figure 6.1, we can immidiately see that the plot of Device type and End
time type fail to add much information, with end time type only having two responses,
both seem to be relatively similar to each other. Whereas, Device type is dominated
by just two devices: desktop and smartphone. Hence, this too will sturggle to add any
valuable information when trying to locate anomalous behaviour. Lets now look at the
variable containing information about the hour in which the page was visited, shown
by the Start hour plot. The plotshows what youd expect for hits on a web page, with
low frequency in the early hours of the morning and increasing steadily until it peaks
in the early afternoon, until a steady decline in the night. However, this is still an
interesting pattern and we will investigate this variable more. If we now look at the
variable, history score, which is a variable that scores in the range 0 to 1, with 0 being unlike bot behaviour and 1 being certain bot behaviour. We see that there is a clear peak
in the plot at 0.38 and then numerous other smaller peaks higher than this too. Hence,
this could also lead to some interesting insights with respect to anomalous behaviour,
so we will also carry on investigating this variable.
From the definition of a bot we are looking for data points that visit a lot of pages
in a short space of time. Therefore, plotting the average time on a web page vs the
number of page visits in a session is show in Figure 6.2. There seems to be a clear
shape to this graph, with the vast majority of points having very low values for both
variables. However, for the data points that have very low values for average time but
high values for number of pages visited, we would expect these to be potential bots from
the definition.
Figure 6.2: Average time vs number of page visits for Clicksco data
If we next look at the variable declaring whether or not a data point is a bot, we can
get a better understanding of what a bot might look like and explore this to identify more
of them. If we look at Figure 6.3, we can see four different plots of what are considered to
be the key variables of the data set. The plot of time agrees with our definition of a bot,
with most of the known bots having an average time very close to zero. Although, there
are around 12 points which have a time much greater than zero, demonstrating how
known bots can evolve to try and mask themselves. The plot of number of page visits,
shows that most known bots visit less than five web pages, which again is unexpected
from the definition. The third plot shows that all known bots have a history score greater
than 0.38, with common scores being 0.38, 0.44 and 0.48. Interestingly, the maximum history score obtained by a bot was 0.63, which is much less than 1 which is given out
for certain bot like behaviour. Finally, the fourth plot shows the hour in which the bot
started visiting the webpages. From our initial analysis, we thought that this variable
may have been interesting and unlocked potential bot like behaviour, but this plot looks
almost like a uniform distribution and hence will offer very little insights to the data.
Thererfore, we will discard this fourth variable and just focus on the other three.
Figure 6.3: Plots of known bots
The data set we are considering is very large and before we can apply some of the
algorithms discussed earlier in the report, we are going to have to condense the data
whilst trying to lose the least amount of information possible. Since there is so much
data with time=0 and page visits less than or equal to 3, we will delete this from the data
set, simply because we cant get much information from these data points. It is worth
noting that some of these data points may in fact be bots, but because the frequency of
their page visits is so low, even if they did visit the web page they wouldnt be choking it
up and we can discard them without much worry. Also, we have just seen that all known
bots have a history score greater than 0.38, hence it would seem sensible to reduce the
data set even more by removing any values which have a history score less than 0.38.
If we now plot the three variables: average time spent on page, number of page visits
and history score for all the data in which history score is greater than 0.38 and the time is not equal to zero with frequency greater than three (this is the data we will consider
from now on) using a 3D scatter plot we obtain Figure 6.4. Where the red coloured data
points correspond to the known bots and the black points correspond to the rest of the
data, in which we are yet to decide if they are bots or not. The known bots all have an
average time close to zero which is what we would expect, but the number of page visits
are all relatively low. Also, note how the known bots seem to be amongst the bulk of
the data and therefore could be potentially difficult to try and find similar anomalous