A SEMINAR REPORT ON SHORT TEXT CLASSIFICATION OF BBC NEWS A SEMINAR REPORT SUBMITTED TO SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE In partial fulfillment for the Degree Of Master of Technology (COMPUTER ENGINEERING) BY DIVYA KISHOR PUNWANTWAR Exam No: 218M 0065 VISHWAKARMA INSTITUTE OF INFORMATION TECHNOLOGY, PUNE (An Autonomous Institute affiliated to Savitribai Phule Pune University) COMPUTER ENGINEERING DEPARTMENT April -May 201 9DEPARTMENT OF COMPUTER ENGINEERING Vishwakarma Institute of Information Technology (An Autonomous Institute affiliated to Savitribai Phule Pune University) CERTIFICATE This is to certify that the seminar -I report entitled SHORT TEXT CLASSIFICATION OF BBC NEWS Submitted by Divya Kishor Punwantwar Exam No: 218M0065 Is a bonafide work carried out by them und er the supervision of Prof.
Kirti H. Wanjale and it is submitted towards the partial fulfillment of the requirement of Savitribai Phule Pune University, Pune for the award of the degree of Master of Technology (Computer Engineering) . Prof. Kirti H. Wanjale Dr. S. R. Sakhare Internal Guide Head Of Department Department of Computer Engineering Department of Computer Engineering VIIT, Pune VIIT, Pune Seal/Stamp of the College Dr.
B. S. Karkare Place: Pune Director, VIIT, Pune Date: iiAbstract A short text is substantially different from traditional long text documents which are due to its shortness and conciseness which is somehow obstruct the applications of conventional machine learning and data mining algorithms in short text classification. According to the traditional artificial intelligence methods, we can divide a short text classification into three steps and they are pre -processing, feature selection and classifier comparison. Specifically, in feature selection, we compared the performance and robustness of the method of TF -IDF weighting and we deliberately chose Naive Bayes as classifier technique. After that, we compared a nd analysed the classifiers horizontally with each other and vertically with feature selections. With the expeditious growth of the number of short text and how to effectively realize the automatic classification of a short text in the information domain i s needed to be solved. According to the characteristics of short text, proposed Naive Bayes, which is classification algorithms based on the improvement of currently integrated classifiers. Traditional classifier Naive Bayes is used as the basis classifier s to train the classification models. Compared with several individual classifiers, our method Naive Bayes have excellent results in a variety of classification evaluation indexes. Based on that BBC news dataset is used to classify using a Naive Bayes algo rithm. Most of the peoples used to read BBC news but everyone has a different interest as like technology, sports, business, politics, and entertainment. iiiAcknowledgement It is matter of great pleasure for me to submit this seminar report on ” SHORT TEXT CLASSIFICATION OF BBC NEWS “, as a part of curriculum for Master of technology (Computer Engineering) of Savitribai Phule University of Pune. I am thankful to my guide Prof. Guide name, Assistant Professor/Associate professor/ Professor in Comp uter Engineering Department for his/her constant encouragement and able guidance. I am also thankful to Dr. B. S. Karkare, Principal of VIIT Pune, Dr. S.R. Sakhare, Head of Computer Department for their valuable support. I take this opportunity to express my deep sense of gratitude towards those, who have helped us in various ways, for preparing my seminar. At the last but not least, I am thankful to my parent, who had encouraged and inspired me with their blessings. Divya Kishor Punwantwar ivContents Certificate ii Abstract iii Acknowledgement iv 1 Introduction 1 1.1 Background 1 1.2 Motivation and Social Impact 2 1.3 Objectives and Outcomes 2 1.4 Mathematical model of problem solved 3 2 Literature Survey 5 2.1 Existing Techniques 5 2.1.1 Technique 1 : An Improved Information Retrieval Approach to Short Text Classification 5 2.1.2 Technique 2 : Short Text Classification Improved by Learning Multi -Granularity Topics 5 3 Implementation 6 3.1 Flow of Work 6 3.2 Data collection and Data sets 6 3.3 Software requirement 7 3.4 Results obtained 9 4 Results and Discussion 11 4.1 Discussion on Result Obtained 11 4.2 Comparison of Results (with other researchers) 11 5 Conclusion and Future Work 12 A Annexure 13 A.1 Source Code and Screenshots 13 A.2 Plagiarism report 15 Bibliography 16List of Figures Figure 1 .1.1: Short Text Classification Figure 1 .4.1: Confusion Matrix Figure 1 .4.2: Accuracy Figure 1 .4.3: Precision and Recall Figure 1 .4.4: F1 -ScoreChapter 1 Introduction 1.1 Background Online social media and news have emerged recently as a medium of information sharing and communication. Blogging, status updates, social networking, watching the news and video sharing are some of the ways in which people try to achieve this. Popular online social media like Facebook, Orkut or Twitter, and news sites like BBC news, CNN, FOX News allows users to post or watch a short message to their homepage. These are often introduced to as micro – blogging sites and the message which is called a status update. News updates from BBC channel are more commonly called as news on a different category of data. News is often related to some event information rapidly. based on the topic of interest like a business, technology, entertainment, personal thoughts, and opinions. News can contain text, emotion, link or their combination. News has recently gained a lot of importance due to their ability to disseminate. Figure 1.1.1: Short Text Classification 11.2 Motivation and Social Impact For easier understanding of users classify the dataset of BBC news. Classifying manually the data into the different category is easy only when the dataset is very short but many times it is not easy to classify or categorize the data which has a large number of data. It is very clumsy or tricky to classify a large number of data set for that use algorithms for classification of short text. This is proposed to classify the BBC new s which is having multiclass and multi labels. 1.3 Objectives and Outcomes The Short text classification task consists of learning models for a given set of classes and applying these models to new imaginary documents for a class assignment. It is mainly a supervised classification task, where a training set subsists of documents with already assigned classes is provided, and a testing set is used for the evaluation of the models. Short text classification is shown in Figure 1, including the pre -processing steps which consist of document representation and space reduction/feature extraction; and the learning/evaluation procedure like Naive Bayes. Great relevance has been deservedly given to learning procedures in short text classification. However, there mu st be a preprocessing stage before the learning process. Pre -processing alters the input space which is used to represent the documents that are conclusively included in the training and testing sets, used by machine learning algorithms to learn classifier s, which they are evaluated after. 21.4 Mathematical model of problem solved Confusion Matrix : Figure 1.4.1: Confusion Matrix ‚· True Positive (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease. ‚· True Negative (TN): We predicted no, and they don’t have the disease. ‚· False positive (FP): We predicted yes, but they don’t actually have the disease. And it is also known as a “Type I error.” ‚· False negative (FN): We predicted no, but they actually do have the disease. And it is also known as a “Type II error.” Accuracy: Figure 1.4.2: Accuracy The accuracy is a measure of the degree of closeness of a measured or calculated value to its actual value . 3Precision and Recall: Figure 1 .4.3: Precision and Recall Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. And Recall is the ratio of correctly predicted positive observations to the all observations in actual class – yes . F1 -Score: Figure 1 .4.4: F1 -Score F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. 4Chapter 2 Literature Survey 2.1 Existing Techniques 2.1.1 Technique 1 : An Improved Information Retrieval Approach to Short Text Classification Twitter act as the most important medium of information sharing and communication. As tweets of Twitter do not provide sufficient word occurrences that are of 140 characters limits and classification methods that use traditional approaches like Bag -Of -Words have some of the limitations. The proposed system used an instinctive approach to determine the class labels with a set of features. The system can able to classify all incoming tweets mainly into three generic categories as News, Movies, and Sports. Since all these categories are diverse and cover most of the topics that people usually tweet about. Experimental results using the proposed technique outperform the existing models in terms of accuracy, precision, recall, support. 2.1.2 Technique 2: Short Text Classification Improved by Learning Multi -Granularity Topics Understanding the fastly growing short text is very essential. A short text is different from traditional document s in its sparsity and shortness, which hinders the application of conventional text mining algorithms and machine learning. The major two approaches have been exploited to enrich the representation of short text. One approach is to fetch contextual informa tion of a short text to directly add more text and the other one is to derive latent topics from an existing large corpus, which are used as features to enrich the representation of short text. The latter approach is elegant and efficient in most cases. To set up effective feature spaces, the topics of certain granularity are usually not sufficient. In this, we move forward along this direction by proposing a method to leverage topics at multiple granularities, which can model the short text more precisely . 5Chapter 3 Implementation 3.1 Flow of Work STEP 1: The features extracted for the classes that are stored in files. STEP 2: The BBC news which has to be correctly classified and the feature sets are fed into the system. STEP 3: The BBC news is then disambiguated. Disambiguation involves tokenizing the news, making the tokens Case -less, removing stop words, lemmatizing the tokens using Word Net, stemming the tokens and finally, the stemmed tokens are Part of Speech tagged. STEP 4: A loop executes on e ach word in the BBC news. A POS tagged word is selected and all senses of that word are learned. STEP 5: If the learned sense is not a noun or verb then it is ignored and skip to the next sense. STEP 6: Loop on all other words in the same news and find their senses. STEP7: Then the definition of all the senses are derived from Word Net. STEP 8: The senses of a precise word are then compared with the senses of the remaining words. An overall score is evaluated and the maximum score is then considered for further. STEP 9: The senses which give these maximum scores are then returned. STEP 10: The steps from 4 to 9 are also executed on the feature sets. STEP 11: The senses of the feature sets and the words of the news are then evaluated. STEP 12: The feature set which gives the maximum similarity with the news of BBC is considered the correct feature set. The class of the feature set is then extracted and the news is classified to that class. 3.2 Data collection and Data sets There is one class of name as BBC and that class contains some files as Entertainment, Technology, Business, Politics, and Sports. Each file contains related category wise news files which is in the form of text of news in BBC. 63.3 Software Requirements 1) Language Used : Python Python is a high -level, interpreted, general -purpose programming language. And it was created by Guido van Rossum, a nd in 1991 python was released. It is used for: Web development (server -side), Software development, Mathematics, System scripting, etc. What can Python do? To create web applications python is used on a server. To create workflows, it can be used alongside the software. Python can connect to database systems so it can also read and modify files. Python can also be used to handle big data a nd perform complex mathematics. Python can be used for production -ready software development or rapid prototyping. Why Python? Because python works on different platforms or supports different platforms like Windows, Mac, Linux, Raspberry Pi, etc. It has a simple syntax which is similar to the English language. As well as it has a syntax that allows developers to write programs with fewer lines than some other programming languages. It runs on an interpreter system which means that code can be executed as soon as it is written. This means that prototyping can be very quick. Python can be treated in a procedural way, a functional way or an object -orientated way. 72) Platform : Jupyter Notebook It is an open -source web application which is allowed to create and share documents which contain equations, visualizations, live code, and narrative text. It has some uses which include data cleaning and transformation, numerical simulation, statistical modeling, data visual ization, machine learning, etc. The notebook extends the console -based approach only to interactive computing in a qualitatively new direction and for providing a web -based application suitable for capturing the whole computation process including developing, documenting, and executing c ode as well as communicating the results. The Jupyter notebook has two components: A web application: It is one of the component of Jupyter Notebook where a browser – based tool for interactive authoring of documents which combine all explanatory text, math ematics, computation s, and their rich media output. Notebook documents: And the second component of Jupyter Notebook is the representation of all content visible in the web application, including inputs and outputs of the computations, explanatory text, i mages, mathematics, and rich media representations of objects. 83.4 Results obtained ‚· Label’s and their counts: ‚· Testing Data: ‚· Tfidf Vectorizer : 9‚· Result as accuracy using Nave Bayes algorithm and their precision, recall, f1 -score, support 10Chapter 4 Results and Discussion 4.1 Discussion on Result Obtained ‚· News are harder to classify than a larger corpus of text. Here we classify news efficiently based on some attributes. ‚· Because of this, it is easy to find news related to some topic. ‚· This is primarily because there are few word occurrences and hence it is difficult to capture the semantics of such messages. ‚· Hence, traditional approaches when applied to classify news do not perform as well as expected. Here, the method used to classify news is a supervised method as it does require a source of data or labelling the news. ‚· In these, by using the Naive Bayes algorithm, it gets an accuracy of near about 96%. 4.2 Comparison of Results (with other researchers) Exis ting short text classification is on twitter tweets. There has a class with different attributes or categories like movie, news, and sports. They use the Naive Bayes algorithm for short text classification and get an accuracy of 60%. While comparing with o ur short text classification of BBC news get almost near to 96% accuracy using the same Naive Bayes algorithm. Which is having some attributes like business, sport, politics, entertainment, technology that contains some text files. 11Chapter 5 Conclusion and Future Work For short classification use, a class BBC news which contains some attributes and each attribute contains some text files related to their attribute. Classify using Naive Bayes algorithm the BBC news with different attributes and calculate the accuracy, support, precision, recall, f1 -score, etc. While comparing with existing short text classification we get the highest accuracy almost near to 96%. The future scope is classifying this BBC news as short text classification by using some other classification algorithms. And calculate precision, recall, support, f1 -score and try to get highest accuracy. 12Appendix A Code A.1 Source Code and Screenshots 1314A.2 Plagiarism report 15Bibliography [1 ] towardsdatascience.com/machine -learning -nlp -text -classification -using -scikitlearn python -and -nltk -c52b92a7c73a [2 ] bedigunjit/simple -guide -to-text -classification -nlp -using -svm -and naive -bayes -wi th-python -421db3a72d34 [3 ] -classification -tutorial -with -naive -bayes / [4] -comprehensive -gu ide -to-understand and -implement -text -classification -in-python / 16