Machine learning basics (part 12): Support vector machines
Linearly separable data (classes)
Which hyperplane in Fig. 6.1 would be a better choice? No doubt the full-line is better, because it leaves more ”room” on either side, so that data in both classes can move a bit more freely, with less risk of causing an error. Such a hyperplane can be trusted more, when it faced with unknown test data. This touches the important issue of classifier design: the generalization performance of the classifier. This refers to the capability of the classifier, designed using the training set, to operate satisfactorily with data outside this set.
Which hyperplane in Fig. 6.1 would be a better choice? No doubt the full-line is better, because it leaves more ”room” on either side, so that data in both classes can move a bit more freely, with less risk of causing an error. Such a hyperplane can be trusted more, when it faced with unknown test data. This touches the important issue of classifier design: the generalization performance of the classifier. This refers to the capability of the classifier, designed using the training set, to operate satisfactorily with data outside this set.
Nonseparable classes
In the situation where the classes are not separable, the previous setup is not valid any more. Fig. 6.5 illustrates this. Any attempt to draw a hyperplane will never end up with a class separation band with no data points inside it, as was the case in the linearly separable case. The training vectors now belong to one of the following three categories:
- Vectors that fall outside the band and are correctly classified. These vectors comply with the constraints of (6.2).
Multiclass classification
Thus far, classfication considered for support vector machines was of two classes only. There are several ways of extending two class procedures to the multiclass case. We only mention two simplest alternatives. One is one-versus-all (OVA or one-versus-rest) in which C (the number of classes) binary classifiers are constructed. The other is one-versus-one (OVO) in which C(C-1)/2 binary classifiers are built for pairs of classes. The classification is based on majority voting.