Opened 14 years ago

Closed 13 years ago

## #542 closed defect (fixed)

# behavior and documentation of SVM::predict are not consistent

Reported by: | Peter | Owned by: | Peter |
---|---|---|---|

Priority: | major | Milestone: | yat 0.5.5 |

Component: | classifier | Version: | 0.5.3 |

Keywords: | Cc: |

### Description (last modified by )

Documentation says that prediction output is the geometrical distance from decision hyper-plane: (w*x+bias)/|w|.

However, checking the code I realize that (w*x+bias)/|w|^{2} is returned because of how private variable margin_ is calculated.

It is not obvious to me whether we should change the documentation or the implementation. Originally we used w*x+bias, which is standard as SVM output. However, that did not work so well in context of Ensembles. Because SVMs for which the training did not work so well tend to have very large |w|, which implies that the average vote will be dominated by these poor SVMs with large |w|. Therefore we chose to penalize SVMs with large |w|. The question is whether we should penalize such that the prediction output corresponds to distance from the hyperplane to data point or if we should penalize the poor SVMs even harder with a denominator |w|^{2
}

### Change History (3)

### comment:1 Changed 14 years ago by

Description: | modified (diff) |
---|

### comment:2 Changed 13 years ago by

Component: | documentation → classifier |
---|---|

Milestone: | yat 0.x+ → yat 0.5.5 |

Status: | new → assigned |

First of all remember that during training |w|^{2} + C |e|^{2} is
minimized. To make a decision here we need to look at a couple of
cases.

**High-dimensional data**

For high-dimensional data, data are always linearly separable and C is often set to infinity. In this case the margin is an indicator of the quality of the classifier, because it measures the distance (projected on the decision hyperplane normal) between the two classes. A large |w| implies a narrow margin between the two classes and it might be a good idea to down-weight the machine in an ensemble aggregation.

**Finite C**

When dealing with low-dimensional data, data are rarely linearly separable and even when it is, it might be preferable to have a finite C to avoid over-fitting. A finite C implies the second term in the objective function (see above) comes into play. The training is now a combination of maximizing the margin while minimizing the error. The margin is no longer a hard-margin but we allow errors. It is not obvious that the soft-margin is a good measure on the quality of the machine. As an example, look at an extreme case when there is no signal at all between the data. In this case, the analytical solution is w=0, which implies a margin=infinity. But the machine is useless since it has accuracy not better than a coin flip both in training as in prediction.

I therefore suggest that predict should behave as is documented and not as the current implementation.

### comment:3 Changed 13 years ago by

Resolution: | → fixed |
---|---|

Status: | assigned → closed |

**Note:**See TracTickets for help on using tickets.

typo