Data analysis report of Australian road transport crash fatalities from 2010-2018

## Prepared by

## Akash Johnson

### University of the Sunshine Coast

## ATMC, Melbourne

## June 2019

### Referencing style: Harvard

TABLE OF CONTENTS 1Introduction ………………………………………………………………………………………………………..11.1Authorisation and Purpose ……………………………………………………………………………..11.2Limitations …………………………………………………………………………………………………….11.3Scope ……………………………………………………………………………………………………………11.4Methodology …………………………………………………………………………………………………12Data setup ………………………………………………………………………………………………………….13Exploratory Data analysis …………………………………………………………………………………..23.1One variable analysis ………………………………………………………………………………………23.1.1One variable analysis 1 …………………………………………………………………………….23.1.2One variable analysis 2 …………………………………………………………………………….33.1.3One variable analysis 3 …………………………………………………………………………….33.2Two-variable analysis ……………………………………………………………………………………..43.2.1Two-variable analysis 1 …………………………………………………………………………….43.2.2Two-variable analysis 2 …………………………………………………………………………….54Advanced analysis …………………………………………………………………………………………….64.1Clustering ……………………………………………………………………………………………………..64.1.1Brief explanation of k-means and clustering ……………………………………………….64.1.2Clustering Analysis …………………………………………………………………………………..64.2Linear regression ……………………………………………………………………………………………84.2.1Brief definition of linear regression ……………………………………………………………84.2.2Linear Regression 1 ………………………………………………………………………………….84.2.3Linear Regression 2 ………………………………………………………………………………….

85Conclusion …………………………………………………………………………………………………………96Reflection …………………………………………………………………………………………………………..9i

1Introduction

1.1 Authorisation and Purpose

This Data analysis report has been assembled to analyse and prepare a report of the Australian road transport crash fatalities from 2010 to 2018. There are no specific achievable goals set out, but anything found to be interesting or significant will be critically analysed to assist other researchers, business representatives, and government agencies.

1.2 Limitations

Only data in the period of 2010 to 2018 will be analysed. BITRE_ARDD_Fatalities_Mar_2019_II. from Australian Road Deaths Database will be the only data that is processed and analysed.

1.3 Scope

Data from Australian Road Deaths Database will be setup and pre-processed. Two, one-variable analysis will be performed, followed by two, two-variable analysis. A graph will be provided with each individual analysis. Clustering and k-means will be briefly explained and a clustering analysis to group certain years will be performed. Linear regression will be briefly explained and two linear regression analyses will be executed, with both models to be plotted.

1.4 Methodology

Research for the report has been gathered from a csv data spreadsheet from Australian Road Deaths Database.

2Data setup

The data BITRE_ARDD_Fatalities_Mar_2019_II.csv is loaded into R using setwd() function by setting the working directory by specifying the path to the desired folder.

16192526670

setwd(“E:Semester 2Introduction to data scienceAssignment”)

161925-140335

Now that the location of the workspace is confirmed BITRE_ARDD_Fatalities_Mar_2019_II.csv can be placed in the workspace location. Considering that the raw data is a comma separated value file (.csv) and not a text file, the built in csv reader can be utilized to read the file information into the variable data. The first row of each column is set as the header by giving header=TRUE. As we dont require the first four rows of the data frame BITRE_ARDD_Fatalities_Mar_2019_II.csv we neglect it by skipping it, using skip = 4.

161925635data<-read.csv(“BITRE_ARDD_Fatalities_Mar_2019_II.csv,header=TRUE, skip=4)

16192529210

Library cluster is loaded for a visualization of clusters.

16192545720

## library(cluster)

16192529845

As the report is limited to the period of 2010 to 2018, we filtered the data and store it in a variable named filtered

16192545720=

filtered<-data[data$Year>=2010 & data$Year<=2018,]

16192529845

As the report require only the columns Crash ID, State, Year, Speed Limit, Age, Gender and Road.User, these columns are selected

and store it in a new variable named col.

16192528575

col<-filtered[,c(State,Year,Speed.Limit,Age,Gender,Road.User)]

The data contain unknown values, that are replaced by -9. Therefore in order to remove all the unknown values, the columns that contain -9 are filtered one by one and saved into a new variable main.

161925-635main<-col[col$Speed.Limit!=-9,]

### main<-main[main$Age!=-9,]

### main<-main[main$Gender!=-9,]

### main<-main[main$Road.User!=-9,]

16192529845

Head function is used to view the first six rows of the filtered dataframe.

16192545720

## head(main)

16192529845

Only the data of 2012, is used for exploratory data analysis.

16192545720

### df<-main[main$year==2012,]

### Exploratory Data analysis

1 One variable analysis

3.1.1One variable analysis 1

16192541275

#Draw boxplot with a variable

## boxplot(df$Age)

161925-140335

## Analysis:

A subplot of age frequency will be made and displayed with a boxplot. This subplot represents the frequency of different age of people who die due to road crash in 2012. A boxplot is preferred as it will clearly define the ages of people who dead in the year 2012 as well as the upper and lower quartile ranges and the outlying values. There are no outlying values in the year 2012. It can be noted that the highest frequency of road crash deaths was in the age of 90 to 100 and the lowest during the age 0. The first quartile has the value 25, second quartile, i.e, the average age of people dead is 40 and the third quartile has the value 60.

3.1.2One variable analysis 2

16192541275

#Draw histogram with a variable

## hist(df$Age)

161925-140335

## Analysis:

The frequency of people of different age dead due to road deaths in 2012 can be best viewed through a histogram. From the graph, its clear that the highest frequency of road deaths was in the age 20 to 30, which was above 250. While the least frequency of road deaths during 2012 was in the age 90 to 100. Majority of all other ages of people, had a frequency in between 75 and 150.

3.2 Two-variable analysis

3.2.1Two-variable analysis 1

16192540005

#Draw a scatter plot with two variables (Speed limit ~ Gender)

### plot(Speed.Limit~Gender,data=main)

161925-140970

## Analysis:

Plotting two variables on a scatter plot allows trends between the two variables become more prominent. The variables analysed are Speed Limit and Gender of people who dead in road crash. The graph has Speed Limit, that is the dependent, in the y-axis and the Age, that is the predictor, in the x-axis. Both variables represent the years from 2010 to 2018. Reviewing the graph, the highest frequency of accidents happened, both in males and females, in the speed limit of 60 to 100. The mean value of female is 100 and that of males is 80. There are no outliers, as the Speed.Limit is fixed.

Relationship between Speed Limit and Gender ( 2010 2018 )

3.2.2Two-variable analysis 2

16192541275

#Draw a scatter plot with two variables(Speed.Limit ~ Road.User)

### Plot(Speed.Limit~Road.User,data=main)

161925-140335

## Analysis:

Scatterplot is preferred when attempting to find trends between two variables. The variables analyzed are Speed Limit, and Road users. In the scatter plot, Speed Limit is the dependent and is in the y-axis. The x-axis has the different road users, that is, the predictor. The drivers and passengers were dead mostly at higher speed limit (almost 100 ). Pedal cyclists and pedestrians faced the highest accidents in the speed limit 60. Motorcycle pillion passengers and riders and other road users faced the highest frequency of accidents at 80 speed limit. The least accidents occurred in the speed limit 20 to 40.

Relationship between Speed Limit and Road.Users

## Advanced analysis

1 Clustering

4.1.1Brief explanation of k-means and clustering

K-means clustering is a prototype based clustering technique. K-means is applied to objects in a continuous n-dimensional space. In K-means clustering, we first choose K initial centroids, where K is the number of clusters desired. Each point is then assigned to the closest centroids. The collection of points assigned to a centroid is a cluster. CITATION Pan06 l 1033 (Pang-Ning Tan, 2006)

4.1.2Clustering Analysis

-114300-317516192540005

### cluster<-main[main$Year==2017,]

cluster<-main[,c(Speed.Limit,Age)]

### scaled<-scale(cluster)

### kmeans <- kmeans(scaled, 10)

## kmeans

## wssplot(scaled)

>wssplot<- function(data, nc=15, seed=1234) {

+ wss<- (nrow(data)-1)*sum(apply(data,2,var))

+ for (i in 2:nc){

+ set.seed(seed)

+ wss[i] <- kmeans(data, centers=i)$tot.withinss

+ }

+ plot(1:nc, wss, type=”b”, xlab=”Number of Clusters”, ylab=”Within groups sum of squares”)

+ }

>wssplot(scaled)

## Analysis:

From the graph, it is clear that cluster is closer from 6 to 15.

-133350177165

clusplot(scaled, fit$cluster,color=TRUE,shade=TRUE,labels=2,lines=0,main=”cluster of age and speed limit in 2017-2018″,xlab=”Age”,ylab=”Speed Limit”)

## Analysis:

The graph shows the cluster between Speed.Limit and age in the year 2017-2018. The graph clearly depicts that the highest accidents occurred at highest speed with an average age.

4.2 Linear regression

4.2.1Brief definition of linear regression

Linear regression displays the relationship between two variables. The main equation for Linear regression is Y=mx+b, where m is the slope and b is the y-intercept. The line fit all the data points. CITATION Jef13 l 1033 (Stanton, 2013) The linear regression line will not be the exact line, it might have errors. Minimize the sum of the squared error.

Plot function is used to plot all the possible scatter plots with the selected columns.

-4762517720plot(main)

From the scatter plots, the plot with increasing positive are selected that allows to plot a straight line.

4.2.2Linear Regression 1

16192541275

#Draw a linear regression scatter plot

linear1<-lm(Speed.Limit~Age,data=main)

16192527305plot(Speed.Limit~Age,data=main)

### abline(linear1,col=red)

## Analysis:

In the scatter plot displayed below, the y-axis represents Speed.Limit, i.e, the dependent and the predictor ( Age ) in the x-axis. It can be assumed that, there is no relation between x and y coordinates. The variables analyzed are Speed Limit and Age in the years 2010 to 2018. When graphed, the relationship becomes apparent, as people in higher age groups have the lowest Speed Limit. A linear line of best fit is generated and displayed on the graph, it is seen to be declining down over the succeeding age groups. While the linear regression line here could be used as a loose basis to model the relationship, in practice it could lead to predictions with large variance from the base data.

Linear Regression analysis ( Speed.Limit ~ Age )

4.2.3Linear Regression 2

16192541275

#Draw a linear regression scatter plot

linear2<-lm(Speed.Limit~Year,data=main)

16192527305plot(Speed.Limit~Year,data=main)

### abline(linear2,col=red)

In the scatter plot displayed below, the y-axis represents Speed.Limit, i.e, the dependent and the predictor ( Year ) in the x-axis. The assumption that Speed.Limit is not dependent on year. The variables analyzed are Speed Limit and Year. When the line of best fit is displayed, it presents a straight line almost parallel to the x-axis. This regression line could be used to predict the Speed Limit over the years of the selected time period.

Linear Regression analysis ( Speed.Limit ~ Year )

5Conclusion

In this data analysis report, various trends and areas of note have been identified. Using one-variable analysis on a boxplot, an average age frequency is identified. Histograms could identify the increase in average Speed limit between the years 2010 to 2018. Utilizing two variable analysis found some loose trends that supported evidence that Speed limit effect the countrys road crash deaths.

6Reflection

When plotting all the possible scatter plots, there was no scatter plot that would show a straight increasing positive line. So the two linear regressions graphs were plotted, considering the most suitable graph with numerical values.

## REFERENCE LIST

BIBLIOGRAPHY l 1033 Pang-Ning Tan, M. S. (2006). Cluster Analysis: Basic Concepts and Algorithms. In M. S. Pang-Ning Tan, Introduction to Data Mining (pp. 496-515). USA: Pearson Education.

Stanton, J. (2013). Line Up, Please. In J. Stanton, Introduction to Data Science (pp. 162-171). New York.