Analysis of Text Classification Algorithms: A Review

Classification of data has become an important research area. The process of classifying documents into predefined categories based on their content is Text classification. It is the automated assignment of natural language texts to predefined categories. The primary requirement of text retrieval systems is text classification, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as answering questions, producing summaries or extracting data. In this paper we are studying the various classification algorithms. Classification is the process of dividing the data to some groups that can act either dependently or independently. Our main aim is to show the comparison of the various classification algorithms like K-nn, Naïve Bayes, Decision Tree, Random Forest and Support Vector Machine (SVM) with rapid miner and find out which algorithm will be most suitable for the users.


INTRODUCTION
Text mining or knowledge discovery is that sub process of data mining, which is widely being used to discover hidden patterns and significant information from the huge amount of unstructured written material. Text mining is largely growing field of computer science simultaneously to big data and artificial intelligence. Text mining and data mining are similar, except data mining works on structured data while text mining works on semi-structured and unstructured data. Data mining is responsible for extraction of implicit, unknown and potential data and text mining is responsible for explicitly stated data in the given text [1]. Today's world can be described as the digital world as we are being dependent on the digital / electronic form of data. This is environment friendly because we are using very less amount of paper. But again this dependency results in very large amount of data. Even any small activity of human produces electronic data. For example, when any person buys a ticket online, his details are stored in the database.
Today approx 80% of electronic data is in the form of text. This huge data is not only unclassified and unstructured (or semi-structured) but also contain useful data, useless data, scientific data and business specific data, etc. According to a survey, 33% of companies are working with very high volume of data i.e. approx. 500TB or more. In this scenario, to extract interesting and previously hidden data pattern process of text mining is used. Commonly, data are stored in the form of text. Broadly there are five steps involved in Text Data Mining. For this text mining uses techniques of different fields like machine learning, visualization, case-based reasoning, text analysis, database technology statistics, knowledge management, natural language processing and information retrieval [2].

TEXT PRE-PROCESSING
The pre-processing itself is made up of a sequence of steps. The first step in text-pre-processing is the morphological analyses. It is divided into three subcategories: tokenization, filtering and stemming [3]. A. TOKENIZATION: Text Mining requires the words and the endings of a document. Finding words and separating them is known as tokenization. B. FILTERING: The next step is filtering of important and relevant words from our list of words which were the output of tokenization. This is also called stop words removal. C. STEMMING: The third step is stemming. Stemming reduces words variants to its root form. Stemming of words increases the recall and precision of the information retrieval in Text Mining. The main idea is to improve recall by automatic handling of word endings by reducing the words to their word roots, at the time of indexing and searching. Stemming is usually done by removing any attached suffixes and prefixes (affixes) from index terms before the actual assignment of the term to the index.

CLASSIFICATION
Classification is a supervised learning technique which places the document according to content. Text classification is largely used in libraries. Text classification or Document categorization has several applications such as call center routing, automatic metadata extraction, word sense disambiguation, e-mail forwarding and spam detection, organizing and maintaining large catalogues of Web resources, news articles categorization etc. For text classification many machine learning techniques has been used to evolve rules (which helps to assign particular document to particular category) automatically [1]. Text classification (or text categorization) is the assignment of natural language documents to predefined categories according to their content. Text classification is the act of dividing a set of input documents into two or more classes where each document can be said to belong to one or multiple classes. Huge growth of information flows and especially the explosive growth of Internet promoted growth of automated text classification [4].

CLASSIFICATION METHODS 1. Decision Trees
Decision tree methods rebuild the manual categorization of the training documents by constructing well-defined true/false queries in the form of a tree structure where the nodes represent questions and the leaves represent the corresponding category of documents. After having created the tree, a new document can easily be categorized by putting it in the root node of the tree and let it run through the query structure until it reaches a certain leaf. The main advantage of decision trees is the fact that the output tree is easy to interpret even for persons who are not familiar with the details of the model [5].

k-Nearest Neighbor
The categorization itself is usually performed by comparing the category frequencies of the k nearest documents (neighbors). The evaluation of the closeness of documents is done by measuring the angle between the two feature vectors or calculating the Euclidean distance between the vectors. In the latter case the feature vectors have to be normalized to length 1 to take into account that the size of the documents (and, thus, the length of the feature vectors) may differ. A doubtless advantage of the k-nearest neighbor method is its simplicity.

Bayesian Approaches
There are two groups of Bayesian approaches in document categorization: Naïve [6] and non-naive Bayesian approaches. The naïve part of the former is the assumption of word independence, meaning that the word order is irrelevant and consequently that the presence of one word does not affect the presence or absence of another one. A disadvantage of Bayesian approaches [7] in general is that they can only process binary feature vectors.

Neural Networks
Neural networks consist of many individual processing units called as neurons connected by links which have weights that allow neurons to activate other neurons. Different neural network approaches have been applied to document categorization problems. While some of them use the simplest form of neural networks, known as perceptions, which consist only of an input and an output layer, others build more sophisticated neural networks with a hidden layer between the two others. The advantage of neural networks is that they can handle noisy or contradictory data very well. The advantage of the high flexibility of neural networks entails the disadvantage of very high computing costs. Another disadvantage is that neural networks are extremely difficult to understand for an average user [4].

Vector-based Methods
There are two types of vector-based methods. The centroid algorithm and support vector machines. One of the simplest categorization methods is the centroid algorithm. During the learning stage only the average feature vector for each category is calculated and set as centroid-vector for the category. A new document is easily categorized by finding the centroid-vector closest to its feature vector. The method is also inappropriate if the number of categories is very large. Support vector machines (SVM) need in addition to positive training documents also a certain number of negative training documents which are untypical for the category considered. An advantage of SVM [8] is its superior runtime-behavior during the categorization of new documents because only one dot product per new document has to be computed. A disadvantage is the fact that a document could be assigned to several categories because the similarity is typically calculated individually for each category. MATLAB is a high-performance, efficient and interactive language for technical computing environment. It integrates Computation, visualization, graphical, processing and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical syntactic notation and graphical form. Typical uses include mathematical matrix form and other computation algorithm development Data acquisition Modeling, image processing, Data processing, simulation, and prototyping Data analysis, exploration, and visualization Scientific and engineering drawing and graphics Application development, including graphical user interface building MATLAB(A Technical Computing Tool) is an interactive programming tool whose basic data element is an array (Matrix form) in different dimensional scheme, that does not require to specify dimensioning. This allows you to solve many technical computing problems in different format, especially those with matrix and vector formulations, in a small fraction of the time it would take to write a program in a specific scalar non interactive language like as C or FORTRAN. The name MATLAB is stands for matrix laboratory. MATLAB was originally written to provide easy access to matrix software developed by the LINPACK and EISPACK and many other technical projects. Today, MATLAB engines enable to incorporate the LAPACK libraries, embedding the state of the art in software for matrix computation and programming.

Figure 1: MATLAB Command Window
MATLAB has evolved over many periods of years with different inputs from many more users. In university research environments, it is the standard and efficient instructional tool for introductory and advanced courses in mathematics, engineering, and medical science. In engineering industry, MATLAB is the tool of choice for better high-productivity research, development, proactive and analysis. MATLAB provide basic features a family of add-on application-specific solutions called toolboxes. Very most important to most and licensed users of MATLAB, toolboxes allow you to learn and apply specialized computing technology. Basically, Toolboxes are comprehensive collections of various MATLAB functions (M-files) and MEX file which is extends the MATLAB environment to solve particular classes of technical computing problems.

EXPECTED OUTCOMES
The proposed text mining algorithm is a replacement for conventional text mining approach. Conventional text mining approach is a mature way to use the correlations of features in the text for mining. Only when the large-scale database of texts is available in the dataset, the proposed scheme can exploit the correlations of external text and significantly reduce false rate of text data.