Soft Computing Techniques Based Image Classification using Support Vector Machine Performance

In this paper we compare different kernel had been developed for support vector machine based time series classification. Despite the better presentation of Support Vector Machine (SVM) on many concrete classification problems, the algorithm is not directly applicable to multi-dimensional routes having different measurements. Training support vector machines (SVM) with indefinite kernels has just fascinated consideration in the machine learning public. This is moderately due to the fact that many similarity functions that arise in practice are not symmetric positive semidefinite. In this paper, by spreading the Gaussian RBF kernel by Gaussian elastic metric kernel. Gaussian elastic metric kernel is extended version of Gaussian RBF. The extended version divided in two ways-time wrap distance and its real penalty. Experimental results on 17 datasets, time series data sets show that, in terms of classification accuracy, SVM with Gaussian elastic metric kernel is much superior to other kernels, and the ultramodern similarity measure methods. In this paper we used the indefinite resemblance function or distance directly without any conversion, and, hence, it always treats both training and test examples consistently. Finally, it achieves the highest accuracy of Gaussian elastic metric kernel among all methods that train SVM with kernels i.e. positive semi-definite (PSD) and Non-PSD, with a statistically significant evidence while also retaining sparsity of the support vector set.


INTRODUCTION
We motivated of kernel algorithm because, Firstly, linearity is moderately special, and outside mathematically no model of a real system is actually linear. Secondly, detecting linear relations has been the focus of much research in statistics, soft computing and machine vision for decades and the resulting algorithms are well understood, well developed and efficient. Naturally, one wants the best of both worlds. So, if a problem is non-linear, instead of trying to fit a nonlinear model, one can map the problem from the input space to a new (higher-dimensional) space (called the feature space) by doing a nonlinear transformation using suitably chosen basis functions and then use a linear model in the feature space. This is known as the `kernel trick'. The linear model in the feature space corresponds to a non-linear model in the input space. This approach can be used in both classification and deterioration problems. The choice of kernel function is crucial for the success of all kernel algorithms and its variety of types because the kernel establishes preceding knowledge that is available about a task. Accordingly, there is no free dine in kernel choice.
According to Martin Sewell, 2007term kernel is resulting from a word that can be sketched back to c. 1000 and originally meant a seed (contained within a fruit) or the softer (usually edible) part contained within the hard shell of a nut or stone-fruit. The former meaning is now superseded. It was first used in reckoning when it was defined for integral equations in which the kernel is known and the other function(s) unknown, but now has several meanings in mathematics. The machine learning term kernel trick was first used in 1998.

Mercer's Theorem:
A symmetric function is a kernel iff for any finite sample the kernel matrix for is positive semi-definite.
One direction of the theorem is easy: if is a kernel, and is the kernel matrix with Then

Theorem:
Consider a finite input space and the kernel matrix K over the entire space. If K is positive semidefinite then is a kernel function.
Proof: By the linear algebra facts above we can write .
Define a feature mapping into a m-dimensional space where the lth bit in feature expansion for the other direction we will prove a weaker result. Example is .

The inner product is
We want to show that Consider entry of the matrix . We have the following identities where the last one proves the result.
Note that Mercer's theorem allows us to work with a kernel function without knowing which feature map it corresponds to or its relevance to the learning problem. This has often been used in practical applications.
In real-life solicitations, however, many similarity functions exist that are either indefinite or for which the Mercer condition is difficult to verify. For example, one can incorporate the longest common subsequence in defining distance between genetic sequences, use BLAST similarity score between protein sequences, use set operations such as union/intersection in defining similarity between transactions, use human-judged similarities between concepts and words, use the symmetrized Kullback-Leibler divergence between probability distributions, use dynamic time warping for time series, or use the refraction distance and shape matching distance in computer vision [1,2,3,4]. Outspreading SVM to indefinite kernels will greatly expand its applicability. Recent work on training SVM with indefinite kernels has generally warped into three categories: Positive semidefinite (PSD) kernel approximation, non-convex optimization (NCO) and learning in Krein spaces (LKS). In the first approach, the kernel matrix of training examples is altered so that it becomes PSD. The motivation behind such approach is to assume that negative eigenvalues are caused by noise [5,6]. The concluding approach was introduced by Luss and d'Aspremont in 2007 with enhancements in training time reported [7,8,9]. All the kernel approximation methods above guarantee that the optimization problem remains convex during training. During experiment, however, the original indefinite kernel function is used.
Hence, training and test examples are treated contradictorily. In addition, such methods are only useful when the similarity matrix is approximable by a PSD matrix.
For other similarity functions such as the sigmoid kernel that can occasionally yield a negative semidefinite matrix for certain values of its hyper-parameters, the kernel approximation approach cannot be utilized.
In the second approach, non-convex optimization methods are used. SMO type decomposition might be used in finding a local minimum with indefinite similarity functions [10].
Haasdonk interprets this as a method of minimizing the distance between reduced convex hulls in a pseudo-Euclidean space [4]. However, because such approach can terminate at a local minimum, it does not assurance learning [1]. Similar to the previous approach, this method only works well if the similarity matrix is nearly PSD.
The next approach that has been proposed in the writings is to extend SVM into the Krein spaces, in which a reproducing kernel is decomposed into the sum of one positive semidefinite kernel and one negative semidefinite kernel [11,12]. Instead of minimizing regularized risk, the objective function is now stabilized. One fairly recent algorithm that has been proposed to solve the stabilization problem is called Eigen-decomposition SVM (ESVM) [12]. While this algorithm has been shown to outperform all previous methods, its primary drawback is that it does not produce sparse solutions, hence the entire list of training examples are often needed during prediction.
The main contribution of this paper is to establish both theoretically and experimentally that the 1-norm SVM [13], which was proposed more than 10 years ago, is a better solution for extending SVM to indefinite kernels. More specifically, 1-norm SVM can be interpreted as a structural risk minimization method that seeks a decision boundary with large similarity margin in the original space. It uses a linear algebra preparation that remains convex even if the kernel matrix is indefinite, and hence can always be solved quite efficiently. It uses the indefinite similarity function (or distance) directly without any transformation, and, hence, it always treats both training and test examples consistently. In addition, it achieves the highest accurateness among all the methods that train SVM with indefinite kernels, with a @ IJTSRD | Unique Paper ID -IJTSRD23437 | Volume -3 | Issue -3 | Mar-Apr 2019 Page: 1647 statistically important indication, while also retaining sparsity of the support vector set. In the literature, 1-norm SVM is often used as an surrounded feature selection method, where learning and feature selection are performed concurrently [14,13,15,17,16,18]. It was studied in [13], where it was argued that 1-norm SVM has an advantage over standard 2-norm SVM when there are redundant noise features. To the knowledge of the authors, the advantage of using 1-norm SVM in handling indefinite kernels has never been established in the writings.
As a state-of-the-art classifier, support vector machine (SVM) has also been examined and applied for time series classification in two modes. On one hand, combined with various feature extraction approaches, SVM can be adopted as a plug-in method in addressing time series classification problems. On the other hand, by designing appropriate kernel functions, SVM can also be performed based on the original time series data. Because of the time axis distortion problem, classical kernel functions, such as Gaussian RBF and polynomial, generally are not suitable for SVM-based time series classification. Motivated by the success of dynamic time wrapping distance, it has been suggested to utilize elastic measure to construct appropriate kernel. Gaussian DTW kernel is then proposed for SVM based time series classification [19,20].
Counter-examples, however, has been subsequently reported that GDTW kernel usually cannot outclass GRBF kernel in the SVM framework. Lei and Sun [21] proved that GDTW kernel is not positive definite symmetric acceptable by SVM. Experimental results [21,22] also showed that SVM with GDTW kernel cannot outperform either SVM with GRBF kernel or nearest neighbor classifier with DTW distance. The poor performance of the GDTW kernel may be attributed to that DTW is non-metric. Motivated by recent progress in elastic measure, Zhang et.al propose a new class of elastic kernel it is an allowance to the GRBF kernel [23].There are lots of Advantages of kernel and its types so some of the types we used in this paper for classification [24]: The kernel defines a similarity measure between two data points and thus allows one to incorporate prior knowledge of the problem domain. Most importantly, the kernel contains all of the information about the relative positions of the inputs in the feature space and the actual learning algorithm is based only on the kernel function and can thus be carried out without explicit use of the feature space. The training data only enter the algorithm through their entries in the kernel matrix (a Gram matrix), and never through their individual attributes. Because one never explicitly has to evaluate the feature map in the high dimensional feature space, the kernel function represents a computational shortcut. The number of operations required is not necessarily proportional to the number of features. Support vector machines is one of the most prevalent classification algorithms. It is inspired by deep learning practicalities, which make use of the Vapnik-Chervonenkis dimension to establish the generalization ability of such clan of classifiers [25,26]. However, SVM has its limitations, which motivated development of numerous variants including the Distance Weighted Discrimination algorithm to deal with the data stacking phenomenon observed in large dimensions [27] and second order conduit programming techniques for handling uncertain or missing values assuming availability of second order moments of data [28]. One fundamental limiting factor in SVM is the need for positive semidefinite kernels.

Methods
In standard two-class classification problems, we are given a set of training data …… , where the input and the output is bnary. We wish to find a classification rule from the training data, so that when given a new input we can assign a class from to it.
To handle this problem, we consider the 1-norm support vector machine: Where a dictionary of basis functions, and is a tuning parameter. The solution is denoted as the fitted model is The classification rule is given by . The 1norm SVM has been successfully used in classification. We argue in this paper that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. To get a good fitted model that performs well on future data, we also need to select an appropriate tuning parameter . In practice, people usually pre-specify a finite set of values for that covers a wide range, then either use a separate validation data set or use cross-validation to select a value for s that gives the best performance among the given set.

Large similarity margins Given a similarity function between
examples and , we can define similarity between an example and a class to be a weighted sum of similarities with all of its examples. In other words, we may write: (4) To denote class similarity between and a class .
Here, the weight represents importance of the example to its class . In addition, we can introduce an offset b that quantifies prior preference. Such offset plays a role that is similar to the prior in Bayesian methods, the activation threshold in neural networks, and the offset in SVM. Thus, we consider classification using the rule: y t= sign{s(xt,+1)-s(xt,-1)+b}, Which is identical to the classification rule of 1-norm SVM given in Eq 4. Moreover, we define the similarity margin for example in the usual sense: Maximizing the minimum similarity margin can be formulated as a linear program (LP). First, we write:

Subject to
However, the decision rule given by Eq. (6) does not change when we multiply the weights ૃ by any fixed positive constant including constants that are arbitrarily large. This is because the decision rule only looks into the sign of its argument. In particular, we can always rescale the weights ૃ to be arbitrarily large, for which . This degree of freedom implies that we need to maximize the ratio instead of maximizing M in absolute terms. Here, any norm suffices but the 1-norm is preferred because it produces sparse solutions and because it gives better accuracy in practice.
Since our objective is to maximize the ratio , we can fix M = 1 and minimize . In addition, to avoid over-fitting outliers or noisy samples and to be able to handle the case of non-separable classes, soft-margin constraints are needed as well. Hence, 1-norm SVM can be interpreted as a method of finding a decision boundary with a large similarity margin in the original space. Such interpretation holds regardless of whether or not the similarity function is PSD. Thus, we expect 1-norm SVM to work well even for indefinite kernels.
Similar to the original SVM, one can interpret 1-norm SVM as a method of striking a balance between estimation bias and variance.

Gaussian Elastic Metric Kernel (GEMK)
Before the definition of GEMK, we first introduce the GRBF kernel, one of the most common kernel functions used in SVM classifier. Given two time series x and y with the same length n, the GRBF kernel is defined as where σ is the standard deviation.
GRBF kernel is a PDS kernel. It can be regard as an embedding of Euclidean distance in the form of Gaussian function. GRBF kernel requires the time series should have the same length and cannot handle the problem of time axis distortion. If the length of two time series is different, resampling usually is required to normalize them to the same length before further processing. Thus SVM with GRBF kernel (GRBF-SVM) usually is not suitable for time series classification. Motivated by the effectiveness of elastic measures in handling the time axis distortion, it is interesting to embed elastic distance into SVM-based time series classification. Generally, there are two kinds of elastic distance. One is non-metric elastic distance measure, e.g. DTW, and the other is elastic metric, which is elastic distance satisfying the triangle inequality. Recently, DTW, one stateof-the-art elastic distance, has been proposed to construct the GDTW kernel [19,20]. Subsequent studies, however, show that SVM with GDTW kernel cannot consistently outperform either GRBF-SVM or 1NN-DTW.
We assume that the poor performance of the GDTW kernel may be attributed to that DTW is non-metric, and suggest extending GRBF kernel using elastic metrics. Thus, we propose a novel class of kernel functions, Gaussian elastic metric kernel (GEMK) functions.

Experiments and Results
In this section, we present experimental results of applying different SVM to image classification problems, and determine its efficiency in handling indefinite similarity functions. As shown in last Figure 1, when the similarity function is PSD, performance of Gaussian TWED SVM is comparable to that of SVM. There are different dataset [1,[29][30][31][32][33][34][35] we used for measuring the performance. When running statistical significance tests, we find no statistically significant evidence that one method better the other at the 96.45% confidence level. The 1-norm SVM method achieves the highest extrapolative accuracy among all methods that learn with indefinite kernels, while also retaining sparsity of the support vector set other than GTWED SVM. Using the error rate as the performance indicator, we compare the classification performance of Gaussian elastic matching kernel SVM with other different similarity measure methods, including nearest neighbor classifier with Euclidean (1NNED), nearest neighbor classifier with DTW (1NN-DTW) nearest neighbor classifier with ODTW (1NN-ODTW), nearest neighbor classifier with ERP (1NN-ERP) and nearest neighbor classifier with OTWED (1NN-OTWED). Table I lists the classification error rates of these methods on each data set. In our experiments, GRBF-SVM takes the least time among all above kernel methods. Because the complexity of Euclidean distance in GRBF kernel is O(n), while in GDTW, GERP and GTWED, the complexity of DTW, ERP and TWED is . Besides, the numbers of support vectors of GERP-SVM and GTWED GTWED-SVM, which are comparable to that of GDTW-SVM, both are more than that of GRBF-SVM. Thus, compared with GRBF-SVM, it also takes more time for GERP-SVM, GTWED-SVM and GDTW-SVM [23]. In addition, Gaussian metric kernel methods in the figure achieves the highest accuracy among all methods that train SVM with kernels, with a statistically significant evidence, while also retaining sparsity of the support vector set. This important singularity property ensures that the 1-norm SVM is able to delete many noise features by estimating their coefficients by zero.