All the previous chapters have been in preparation for this Essay

All the previous chapters have been in preparation for this last chapter where we in-

vestigate a data set to try and identify bots. A bot [20] is a software application that

Don't use plagiarized sources. Get Your Custom Essay on
All the previous chapters have been in preparation for this Essay
Order Essay

automatically performs simple and structurally repetitive tasks over the internet at much

higher rates than humanly possible.

Bots were initially introduced to perform important tasks on the internet, especially

in relation to operating search engines. However, now around 66% of internet bots

are considered to be malicious [20]. These bots can do multiple things when visiting a

website, including scraping personal information or making the website unavailable to

users by generating large volumes of traffic against the target.

Due to the dangers these

bots can create, it has become an essential task to try and identify these bots and stop

them from visiting your webpage. When malicious bots were first starting to be used,

they were relatively easy to identify due to their low levels of sophistication. However,

now bots are much more advanced and often act in identical ways to that of a human

[20], making them almost impossible to detect .

Hence in our analysis, we have to accept

that we are very unlikely to find all the bots within the data set, but detecting any will

be a big step in protecting the website from future attacks.

6.1Initial Data Analysis

The data we will be looking at was provided by a local company, Clicksco and contains

information about the visits to a webpage on the day of 2018-06-10. The data set contains

1,048,575 pieces of data, in which each one contains 17 variables of information. The

variables include information such as session ID (a unique identifier for the browsing

session), start and end time of each session, the device type the web page was visited on

and whether or not that profile is known to be a bot. The final goal is to try and isolate

the key variables when identifying a bot and then use these to find the bots within the

data set.

From the initial variables, we plot the history score, device type and end time type in

Figure 6.1. We then use some of the initial variables to create 3 new variables: average

time spent on a web page, number of web pages visited per session and the hour in which the session first started. The plots of these three variables are also shown in Figure 6.1.

Looking at Figure 6.1, we can immidiately see that the plot of Device type and End

time type fail to add much information, with end time type only having two responses,

both seem to be relatively similar to each other. Whereas, Device type is dominated

by just two devices: desktop and smartphone. Hence, this too will sturggle to add any

valuable information when trying to locate anomalous behaviour. Let’s now look at the

variable containing information about the hour in which the page was visited, shown

by the Start hour plot. The plotshows what you’d expect for hits on a web page, with

low frequency in the early hours of the morning and increasing steadily until it peaks

in the early afternoon, until a steady decline in the night. However, this is still an

interesting pattern and we will investigate this variable more. If we now look at the

variable, history score, which is a variable that scores in the range 0 to 1, with 0 being unlike bot behaviour and 1 being certain bot behaviour. We see that there is a clear peak

in the plot at 0.38 and then numerous other smaller peaks higher than this too. Hence,

this could also lead to some interesting insights with respect to anomalous behaviour,

so we will also carry on investigating this variable.

From the definition of a bot we are looking for data points that visit a lot of pages

in a short space of time. Therefore, plotting the average time on a web page vs the

number of page visits in a session is show in Figure 6.2. There seems to be a clear

shape to this graph, with the vast majority of points having very low values for both

variables. However, for the data points that have very low values for average time but

high values for number of pages visited, we would expect these to be potential bots from

the definition.

Figure 6.2: Average time vs number of page visits for Clicksco data

If we next look at the variable declaring whether or not a data point is a bot, we can

get a better understanding of what a bot might look like and explore this to identify more

of them. If we look at Figure 6.3, we can see four different plots of what are considered to

be the key variables of the data set. The plot of time agrees with our definition of a bot,

with most of the known bots having an average time very close to zero. Although, there

are around 12 points which have a time much greater than zero, demonstrating how

known bots can evolve to try and mask themselves. The plot of number of page visits,

shows that most known bots visit less than five web pages, which again is unexpected

from the definition. The third plot shows that all known bots have a history score greater

than 0.38, with common scores being 0.38, 0.44 and 0.48. Interestingly, the maximum history score obtained by a bot was 0.63, which is much less than 1 which is given out

for certain bot like behaviour. Finally, the fourth plot shows the hour in which the bot

started visiting the webpages. From our initial analysis, we thought that this variable

may have been interesting and unlocked potential bot like behaviour, but this plot looks

almost like a uniform distribution and hence will offer very little insights to the data.

Thererfore, we will discard this fourth variable and just focus on the other three.

Figure 6.3: Plots of known bots

The data set we are considering is very large and before we can apply some of the

algorithms discussed earlier in the report, we are going to have to condense the data

whilst trying to lose the least amount of information possible. Since there is so much

data with time=0 and page visits less than or equal to 3, we will delete this from the data

set, simply because we can’t get much information from these data points. It is worth

noting that some of these data points may in fact be bots, but because the frequency of

their page visits is so low, even if they did visit the web page they wouldn’t be choking it

up and we can discard them without much worry. Also, we have just seen that all known

bots have a history score greater than 0.38, hence it would seem sensible to reduce the

data set even more by removing any values which have a history score less than 0.38.

If we now plot the three variables: average time spent on page, number of page visits

and history score for all the data in which history score is greater than 0.38 and the time is not equal to zero with frequency greater than three (this is the data we will consider

from now on) using a 3D scatter plot we obtain Figure 6.4. Where the red coloured data

points correspond to the known bots and the black points correspond to the rest of the

data, in which we are yet to decide if they are bots or not. The known bots all have an

average time close to zero which is what we would expect, but the number of page visits

are all relatively low. Also, note how the known bots seem to be amongst the bulk of

the data and therefore could be potentially difficult to try and find similar anomalous

points that can be explained by bots.

Still stressed from student homework?
Get quality assistance from academic writers!