Video Surveillance for violence detectionusing deep learningManan Sharma1, Rishabh Baghel11Indian Institute of Information Technology Guwahati, India{manansharma858, baghelrishabha}@gmail.comAbstract. In order to detect violence through surveillance cameras, weprovide a neural architecture which can sense violence and can be ameasure to prevent any chaos. This architecture uses Convolutionalneural networks to aid the Long-Short Term Memory cells to extractbetter features. We use a short-term difference of video frames toprovide more robustness in order to get rid of occlusions anddiscrepancies. Convolutional Neural Networks allow us to get moreconcentrated spatio-temporal features in the frames, which aids thesequential nature of videos to be fed in LSTMs.
The model incorporatesa pre-trained Convolutional Neural Network connected toConvolutional LSTM layer. The model takes raw videos as an input,converts it into frames and output a binary classification of violence ornon-violence label. We have pre-processed the video frames usingcropping, dark-edge removal and other data augmentation techniques tomake data rid of unnecessary details. In order to evaluate theperformance of our proposed method, three standard public datasetswere used and accuracy as the metric evaluation is used.
Keywords. Violence detection, Residual networks(ResNets),Convolutional Long-Short Term Memory cells(ConvLSTM), Deeplearning.1 IntroductionDue to increasing road rage and increasing violence in public places, it has nowbecome a necessity to monitor and get notified about these activities to avoidsomething more severe. Detecting actions in a video are actually quite harder thanbelieved. One has to infer a situation’s actions from frames of data, and this is notjust a question of image recognition but of action inference, which requiresreasoning. The use of deep learning to solve such computer vision problems [13] isincreasing, cameras are now capable enough to replace a human reasoning power andeven surpass it. Using these deep learning algorithms [12] [2] cuts the need forhandcrafted features and provide an end-to-end model for successful completion ofthe task. Videos, of course, are sequences of images. While most state-of-the-artimage classification systems use convolutional layers in one form or another,2sequential data is frequently processed by Long Short-Term Memory (LSTM)Networks as shown in [8]. Consequently, a combination of these two building blocksis expected to perform well on a video classification task.One such combination has the self-descriptive name of ConvLSTM. Standard LSTMuses simple matrix multiplication to weigh the input and previous state inside thedifferent gates. In ConvLSTM, these operations are replaced by convolutions [10].We have made a deep neural network with a ResNet50 block, a ConvLSTM block,and a fully connected block. We show how the change in action is more effectivethan the state of the action by feeding in the difference of the frames. We also showhow videos as a sequential data can be fed into recurrent networks (HereConvLSTM), as mentioned in [14] and long-range dependencies of action can help indetecting what type of actions are being performed (Here violence). To show theeffectiveness of the model we use three different datasets namely KTH [6], ViolentFlows [7], Hockey Dataset[6].ConvLSTM is a variant of LSTM (Long Short-Term Memory) containing aconvolution operation inside the LSTM cell. The model is a special kind of RNN,capable of learning long-term dependencies. ConvLSTM replaces matrixmultiplication with convolution operation at each gate in the LSTM cell. By doingso, it captures underlying spatial features by convolution operations inmultiple-dimensional data as shown in [1]. This combines the pattern recognition ofConvNets and the memory’ properties of pure LSTM networks. The architecture ofConvLSTM is expected to, therefore, find patterns in images sequences.2 Network ArchitectureThe network contains a CNN block and a ConvLSTM block, For CNN block wehave used a pre-trained ResNet50 architecture . The video frames are fed as adifference of two adjacent frames of the original video so 20 frames are fed whosedifference becomes 10 frames. These 10 frames are fed sequentially to the ResNet50.This ResNet50 is a pre-trained network on an ImageNet database [5], the output 3-Dfeature maps are then fed into the ConvLSTM. The input to ConvLSTM is a 256filters feature map, with filter size 3×3 and stride is 1. Each hidden state is of size256 feature maps. Before feeding in the input frames, they are randomly cropped,flipped horizontally and vertically, normalized to make them centered around themean that is mean zero and variance unity. The networks run for 6000 iterations. Tomake the final prediction the output of LSTMs is batch normalized and fed into fullyconnected layers of size 1000, 256, 10 and 1 as the prediction is binary in nature. Thenon-linearity used between fully connected layers is ReLu and the last 1 neuron layeruses sigmoid with binary entropy as the loss function alongside RMSprop optimizer[3].The reason for feeding in the difference of frames is to incorporate the changes inaction rather than the actions themselves. The technique is an adaptation of opticalflow images for action recognition by Zisserman and Simonyan [15]. The changes inaction are fed to ResNet50 which extract the features and feed them into ConvLSTMwhich learns the dependency of changes on previous actions.3Fig. 1 Architecture of Model3 ExperimentsIn order to determine the effectiveness of our proposed method in classifyingviolence videos, three standard public datasets are used and hence classificationaccuracy is measured.3.1 Experimental SettingsThe whole network is implemented with Keras and TensorFlow as a back-end. Thenetwork is trained using gradient descent optimization algorithm namely (RMSprop).Since nearby frames contain overlapping information there might be redundantcomputation involved while processing frames. In order to avoid these computations,frames which are extracted from each video are resized to dimension 256 * 256during training. We evaluate different CNN architectures and based on thesearchitectures results are compared. Furthermore, we are using dynamic learning rateadjustments, which reduces the sequence length and one perceptron with sigmoidactivation function is used. The whole network is trained on an NVIDIA GTX1080TiGPU, due to this reason we could fit 2 samples per batch that are (batch size is 2) andthe frames per sequence is 20. Since we don’t know anything about data, so theremight be some difficulty while assigning weights that would particularly work in thatcase. In order to overcome this problem, we can assign weights based on Gaussiandistribution. So there is an algorithm called as Xavier algorithm which we used forinitializing weights as mentioned in [4].In Table 1 we summarize the things that we made in our implementation.4Table 1 Depicting different hyper parametersParameters Methods usedCNN Architecture ResNet50 , InceptionV3 , VGG 19Learning rate reducing DynamicCross Entropy (loss) BinaryBatch Size 2Sequence length 10 or 20Evaluation Simple-split3.2 Tuning of hyper-parametersDue to limitation in resources we used a simple split as, for testing we have used 20%(20% of testing is also used for validation) and for training 80% of data is used. Weare evaluating the different hyper-parameters of the network for different datasets.We use only 20 epochs and early stopping of 5 as we apply in the final optimalnetwork training. Each hyper parameter has been evaluated separately and the bestvalue is chosen for next evaluations. In Table 2 we are representing different hyperparameters which are being evaluated in each iteration.Table 2 Tuning of hyper parametersPARAMETRES CASE1 CASE2 CASE3Type of CNN Architecture ResNet50 InceptionV3 VGG19Learning Rate le-4 le-3 le-2Augmentation Used True False TrueNumber of Frames 20 30 20Drop Out 0 0.5 0Type of Training CNN Retrain CNN Static CNN Retrain3.3 DatasetsBased on our analyses most challenging datasets for violence detection in theliterature are listed below in Table 3. Different datasets represent different types ofviolence seen within city, street and indoor environments.5Table 3 Datasets for Violence detectionDATASET CLASSES VIDEOS PROPERTIESKTH – Number of action classes=6- Actions: Walking, jogging,running, boxing, handwaving and hand clapping.- 600 videos- Resolution=160*120- Black and WhiteVideos.- Static Camera- Background:Indoor/Outdoor.- performed by24 persons, 4scenes and 6actions.Hockey FightDataset- Actions happening in icehockey sink- 1000 videos (500violence and 500non-violence)- Resolution 720*576- Non Crowdedviolencevideos.Violent Flows – Violent Actions in crowdedplaces.- 200 videos(100violence and 100non-violence)- Resolution 320*240- Database ofreal world,footage ofcrowd violence.3.4 Data Pre-processingAs a preparation for the graph input few steps were taken in the dataset preparation,initially the videos were sampled to a frame by frame sequence as we were limitedwith computational power. The videos were sampled into a fixed number of framesbefore given as an input to the model. For all dataset combination of augmentationmethods were used and for some of the datasets, dark edges were removed from theframe as we present in Fig. 3. As the original article stated, the input to the model isa subtraction of adjacent frames, this was done in order to include spatial movementsin the input videos instead of the raw pixels from each frame. In Fig. 2 we present anexample of difference computation of adjacent frames where a hockey player pushesanother player.Fig. 2 Difference between FramesFig. 3 Dark Edges Removal6Data augmentation is applied with the following transformations in order to enrichour dataset.Image Cropping: Edges of the images are removed before feeding into the networkto make the pattern in the images more concrete as shown in Fig. 4.Image Transpose: As a complement steps to the cropping process, a transpose wasdone, this step was done during the fit generator process as shown in Fig. 5.Fig. 4 Random Cropping of the imagesFig. 5 Transpose of an Image4 ResultsAs already we mentioned in section 3.2 about the hyper-tuning process which allowsus to find out the parameters which are performing best in the network. The testaccuracy for each of the hyper-parameters values is shown in Fig. 6. ConvolutionalNeural Network (CNN) which gives the good result among all the three is ResNet50CNN with the accuracy of 89.9%, the InceptionV3 CNN almost give the same resultlike ResNet50 with an accuracy of 88.6% but the VGG19 CNN architecture does notperform well since it has 79.3% accuracy. It has been noted that due to augmentationthe accuracy has been increased by 4.53% and also by making the length of sequencesmaller, accuracy improved by 2%. As usually expected in case of static CNNconfiguration in which the CNN weights are not trained had very poor results of58.9% accuracy.Fig. 6 Hyper Parameter tuning test accuracy scores7The results which are presented below in Fig. 7, 8 and 9 are basically depicting linecharts. The accuracy of the train is depicted in blue, accuracy of the test is depictedin grey, validation is represented as yellow along with the training loss which isdepicted in Orange by a number of epochs. As mentioned earlier all experiments runtill 50 epochs, where for all cases early stopping take place.Fig. 7 is depicting the results for hockey dataset where it is noted that learning ratewas reduced to 4 times which is started at 1e-4 and ended at 5e-5 at final epoch, atepoch 34 due to early stopping the training has stopped, reaching 87.7% for the testdata in the last epoch and in overall epochs 89.3% is noted as best accuracy.Fig. 7 Hockey Dataset resultsSimilarly, results for the Violent-Flow dataset are depicted in Fig. 8. Here also it isnoted that the learning rate has reduced twice starting at 1e-4 at starting epoch tovalue of 2.5e-5 at the last epoch. Till the last epoch test accuracy of the model is86.5% and overall accuracy is noted as 92.4% for overall epochs.Fig. 8 Violent flows Dataset resultsThe results for the KTH dataset are presented in Fig. 9 At epoch 33 learning rate isreduced by one time and overall 100% accuracy is achieved in test, validation and fortraining.8Fig. 9 KTH Dataset results5 DiscussionFor the evaluation of our architecture, we ran through several CNN architectures butwe concentrated our result on three major CNN architectures, namely ResNet50,inceptionV3, VGG19. Apart from all these we also checked out many hyperparameter combinations. Among the three architectures, VGG19 gave the worstresults, this can be explained by the fact that ResNet50 and inceptionV3 already gavebetter classification accuracies on ImageNet dataset. An interesting case is the betteraccuracy given by ResNet50 as compared to inceptionV3 in the violence detectionarchitecture even though inceptionV3 was better in ImageNet classification [9],inceptionV3 is 1.5% more classification accuracy on ImageNet dataset. The reasonfor this case can be the depth of the two architectures as ResNet50 is deeper with 168layers than inceptionV3 with 159 layers, depth may be a reason for betteridentification of violence activities. Among the hyper parameters, using dynamiclearning rate gave better results. We got better results when we initialize the modelusing the starting learning rate as 0.0001 as compared using 0.001. The higherlearning rate gave a poor generalization. The reason for this can be that largerlearning rates cause extreme changes in the learning process resulting to poorgeneralization. We started with the learning rate 0.0001 and slowly increased thelearning rate which gave better accuracy. The reason for increasing the learning rateis faster learning of the optimal weights. Also, due to the domain, specific nature ofthe datasets dropout did not help much.Due to lack of labeled datasets, certain data augmentation techniques are used toincrease samples and make the dataset more generalized. We took three differentdatasets which are KTH dataset, violent-flow dataset, and hockey dataset. In KTHdataset the learning curve did not reduce until the last step, it was easier to convergeover comparing to other datasets, it achieved 100% accuracy. In violent-flow dataset,we could achieve only 80% accuracy. Due to a large crowd [11] in videos which wasnot participating in violent activities, we divided the dataset into small pieces andused bagging method to make the final classification. In hockey dataset we achieved87.5% accuracy, this dataset encountered four reducing points in the learning curve.96 References1. F. D. De Souza, G. C. Chavez, E. A. do Valle Jr, and A. d. A. Araujo (2010)Violence detection in video using spatio-temporal features. In Conference onGraphics, Patterns and Images (SIBGRAPI).2. P. Bilinski and F. Bremond (2016) Human violence recognition and detectionin surveillance videos. In AVSS.3. A. Datta, M. Shah, and N. D. V. Lobo (2002) Person-on-person violencedetection in video data. In ICPR.4. J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S.Venugopalan, K. Saenko, and T. Darrell (2015) Long-term recurrentconvolutional networks for visual recognition and description. In CVPR.5. T. Giannakopoulos, A. Pikrakis, and S. Theodoridis (2007) A multi-classaudio classication method with respect to violent content in movies usingbayesian networks. In IEEE Workshop on Multimedia Signal Processing(MMSP).6. I. S. Gracia, O. D. Suarez, G. B. Garcia, and T.-K. Kim (2015) Fast ghtdetection. PloS one, 10(4):e0120448.7. T. Hassner, Y. Itcher, and O. Kliper-Gross (June 2012) Violent ‚ows :Real-time detection of violent crowd behavior. In CVPR Workshops.8. S. Hochreiter and J. Schmidhuber. (1997) Long short-term memory. Neuralcomputation, 9(8):1735″1780.9. A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classicationwith deep convolutional neural networks. In NIPS.10. J. R. Medel and A. Savakis (2016) Anomaly detection in video usingpredictive convolutional long short-term memory networks. arXiv preprintarXiv:1612.00390.11. S. Mohammadi, H. Kiani, A. Perina, and V. Murino (2015) Violencedetection in crowded scenes using substantial derivative. In AVSS.12. E. B. Nievas, O. D. Suarez, G. B. Garc±a, and R. Sukthankar (2011) Violencedetection in video using computer vision techniques. In InternationalConference on Computer Analysis of Images and Patterns. Springer13. P. Rota, N. Conci, N. Sebe, and J. M. Rehg (2015) Real-life violent socialinteraction detection. In ICIP.14. I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learningwith neural networks. In NIPS.15. K. Simonyan and A. Zisserman (2014) ,Two-stream convolutional networksfor action recognition in videos. In NIPS.