Deployment of ID3 decision tree algorithm for placement prediction

This paper details the ID3 classification algorithm. Very simply, ID3 builds a decision tree from a fixed set of examples. The resulting tree is used to future samples. The decision node is an attribute test with each branch (to another decision tree) being a possible value of the attribute. ID3 uses information gain to help it decide which attribute goes into a decision node. The main aim of thi identify relevant attributes based on quantitative and qualitative aspects of a student's profile such as CGPA, academic performance, technical and communication skills and design a model which can predict the placement of a student. For this ID3 classification technique based on decision tree has been used.


INTRODUCTION
Classification is the process to map data into predefined groups or classes. Also called supervised learning because classes are determined before examining data. It can also be defined as D= {t1,t2,…………………………..,tn} C= {C1,C2,………………………...,Cm} where data is defined by D having set of tuples that is assigned to class C. e.g. Pattern recognition, an input pattern is classified into one of several classes based on similarity. A bank officer who has the authority to approve the loan of any person then he has to analyze customer behavior to decide passing the loan is risky or safe that is called classification. @ IJTSRD | Available Online @ www.ijtsrd.com | Volume -2 | Issue -3 | Mar-Apr 2018 ISSN No: 2456 -6470 | www.ijtsrd.com | Volume International Journal of Trend in Scientific Research and Development (IJTSRD) International Open Access Journal Deployment of ID3 decision tree algorithm for placement prediction Kirandeep, Prof. Neena Madan .Tech (CSE), G.N.D.U, Regional Campus, Jalandhar, Punjab, India This paper details the ID3 classification algorithm. Very simply, ID3 builds a decision tree from a fixed set of examples. The resulting tree is used to classify future samples. The decision node is an attribute test with each branch (to another decision tree) being a possible value of the attribute. ID3 uses information gain to help it decide which attribute goes into a decision node. The main aim of this paper is to identify relevant attributes based on quantitative and qualitative aspects of a student's profile such as CGPA, academic performance, technical and communication skills and design a model which can predict the placement of a student. For this purpose ID3 classification technique based on decision tree Classification is the process to map data into predefined groups or classes. Also called supervised learning because classes are determined before examining data. It can also be defined as D= {t1,t2,…………………………..,tn} C= {C1,C2,………………………...,Cm} is defined by D having set of tuples that is e.g. Pattern recognition, an input pattern is classified into one of several classes based on similarity. A bank officer who has the authority to approve the to analyze customer behavior to decide passing the loan is risky or safe

Predicting tumor cells as benign or malignant
Helpful in the field of medical science for predicting whether the tumor cells are malignant or not.

Predicting tumor cells as benign or malignant
Helpful in the field of medical science for predicting malignant or not.

Classifying credit card transactions as legitimate or
To check whether the transactions are legal or not.
Classifying secondary structures of protein as alphasheet, or random coil For classification of proteins on the basis of their y=mx+b This can be equated as partitioning of two classes.If we attempt to fit data that is not linear to the linear model , the result will be poor model of data.
Bayesian classification: By analyzing each independent attribute, a conditional probability is determined. Consider a data value xi,the probability that a related tuple ti, is in class Cj can be given as P(Cj|xi) i.e. P(xi),P(Cj),P(xi|Cj) from these values,Bayes theorem allows to estimate the probability P (Cj|xi) & P(Cj|ti) According to the theorem, B. Distance based algorithms: Assignment of the tuple to the class to which it is most similar.

Algo:
Input: c1,c2,…….,cm(Centers for each c) //input tuple Output: C //class to which t is assigned dist=inf ; for i=1 to m do if dist(ci,t) < dist ; then c=i; dist=dist(ci,t); C. Decision tree based algorithms: A 2-Step process includes 1) Construction of tree where each internal node is labeled with an attribute. 2) Leaf node is labeled with class.

THE ID3 ALGORITHM
A technique to build a decision tree based on information theory and attempts to minimize the no. of comparisons.
The ID3 algorithm begins with the original set as the root node. On each iteration of the algorithm, it iterates through every unused attribute of the set and calculates the entropy ( information gain ) of that attribute. It then selects the attribute which has the smallest entropy (or largest information gain) value. The set is then split by the selected attribute (e.g. age is less than 50, age is between 50 and 100, age is greater than 100) to produce subsets of the data. The algorithm continues to recurse on each subset, considering only attributes never selected before.
1. Calculate the entropy of every attribute using the data set. 2. Split the set into subsets using the attribute for which entropy is minimum (equivalently, information gain is maximum) 3. Make a decision tree node containing that attribute. 4. Recurse on subsets using remaining attributes.
ID3 is based off the Concept Learning System (CLS) algorithm. The basic CLS algorithm over a set of training instances C: Step 1: If all instances in C are positive, then create YES node and halt. If all instances in C are negative, create a NO node and halt. Otherwise select a feature, F with values v1, ..., vn and create a decision node.
Step 2: Partition the training instances in C into subsets C1, C2, ..., Cn according to the values of V.
Step 3: Apply the algorithm recursively to each of the sets Ci.
ID3 searches through the attributes of the training instances and extracts the attribute that best separates the given examples. If the attribute perfectly classifies the training sets then ID3 stops; otherwise it recursively operates on the n (where n = number of possible values of an attribute) partitioned subsets to get their "best" attribute.The algorithm uses a greedy search, that is, it picks the best attribute and never looks back to reconsider earlier choices.

Data Description
The sample data used by ID3 has certain requirements, which are:  Attribute-value description -the same attributes must describe each example and have a fixed number of values.  Predefined classes -an example's attributes must already be defined, that is, they are not learned by ID3.  Discrete classes -classes must be sharply delineated. Continuous classes broken up into vague categories such as a metal being "hard, quite hard, flexible, soft, quite soft" are suspect.

b) Attribute Selection
How does ID3 decide which attribute is the best? A statistical property, called information gain, is used. Gain measures how well a given attribute separates training examples into targeted classes. The one with the highest information (information being the most useful for classification) is selected. In order to define gain, we first borrow an idea from information theory called entropy. Entropy: A formula to calculate the homogeneity of a sample then the entropy S relative to this c-wise classification is defined as Entropy(e1,e2,….en)=-p1logp1-p2logp2….-pnlogpn Entropy(S) =∑-p(x) log p(x) Where Pi is the probability of S belonging to class i. Logarithm is base 2 because entropy is a measure of the expected encoding length measured in bits. For e.g.
If training data has 7 instances with 3 positive and 4 negative instances, the entropy is calculated as Entropy ([3+,4-]) = -(3/7)log(3/7)-(4/7)log(4/7)=0.016 Thus, the more uniform the probability distribution, the greater is its entropy. If the entropy of the training set is close to one, it has more distributed data and hence, considered as a good training set.
Information Gain: The decision tree is built in a topdown fashion. ID3 chooses the splitting attribute with the highest gain in information, where gain is defined as difference between how much information is needed after the split. This is calculated by determining the differences between the entropies of the original dataset and the weighted sum of the entropies from each of the subdivided datasets. The motive is to find the feature that best splits the target class into the purest possible children nodes -pure nodes with only one class This measure of purity is called information. It represents the expected amount of information that would be needed to specify how a new instance of an attribute should be classified. The formula used for this purpose is:

G(D, S) = H(D) -∑P(Di)H(Di)
Reasons to choose ID3 1. Understandable prediction rules are created from the training data. 2. Builds the fastest tree & short tree. 3. Only need to test enough attributes until all data is classified.

III. IMPLEMENTATION
Campus placement is a process where companies come to colleges and identify students who are talented and qualified, before they finish their graduation. The combination of various attributes determines whether the student is placed or not. The quantitative aspects like undergraduate CGPA. The qualitative aspects like communication and programming skills form a backbone for a student to get placed as each recruiting company desires to hire students that have a sound technical knowledge and ability to communicate effectively. The other factors like internships, backlogs, future studies add value only when the prior requirements are met. The attributes and the possible values are explained below The root node chosen here is Programming Skills.And Further classification is done by calculating information gain and entropy for each attribute.

Attributes
Entropy Gain