Opened 14 years ago

Closed 13 years ago

#182 closed request (fixed)

Implement Nearest Neighbour Classifier

Reported by: Peter Owned by: Markus Ringnér
Priority: major Milestone: yat 0.4
Component: classifier Version: trunk
Keywords: Cc:

Description

A sample is classified to the class having most samples among the nearest neigbors.

Possible parameters are: number of neigbors, Distance (to calculate who is a neighbor), and also which method to calculate the output.

For the latter, the default is for each class to simply count how many of the nearest neighbors that are in that specific class. This will (except for the 2-classes case and k is odd) lead to problematic ties in the output. A way to avoid ties is to build in the distance in the output calculation. For example in a NNI manner one could weight the count with the reciprocal distance, so I guess it makes sense to add a functor to the class telling how this weighting should be done (default w=1.0).

Change History (11)

comment:1 Changed 14 years ago by Markus Ringnér

Milestone: later0.4
Status: newassigned

comment:2 Changed 13 years ago by Jari Häkkinen

Summary: implement Nearest Neigbor ClassifierImplement Nearest Neighbour Classifier

comment:3 Changed 13 years ago by Markus Ringnér

Regarding output of the KNN, I think we should support the common variants with the votes of the k neighbours:

  • unweighted
  • weighted with the reciprocal distance
  • weighted with the reciprocal rank, i.e. the nearest neighbour gets weight 1/1, next closest gets 1/2, and so on.

I guess supporting weighting votes with the reciprocal ranks complicates Peter's idea of using a functor for selecting among these choices.

comment:4 Changed 13 years ago by Markus Ringnér

Regarding how to decide the winning class. Perhaps this should not be implemented within KNN? As implemented currently, both NCC and KNN (NBC/SVM/..?) return a value for each class (a distance: the smaller the better). Decisions about winning classes are not implemented and left to the user.

Should we provide a general utility for this which could be used by all classes? Design ideas? Either way, there has to be support for outputting an non-truncated value for each class as ensembles should be possible to construct with different voting schemes: unweighted ensemble members, ensemble members weighted with a value for each class etc., similar to the KNN weighting alternatives suggested above.

comment:5 Changed 13 years ago by Markus Ringnér

Status: assignednew

comment:6 Changed 13 years ago by Markus Ringnér

Status: newassigned

comment:7 Changed 13 years ago by Markus Ringnér

Should we normalize the outputs in some way?

With uniform weights, we could normalize each vote with 1/k so the output for a class corresponds to the fraction of nearest neighbors in the class.

With reciprocal rank weights a similar thing could be done; normalize each vote with 1/sum_i(1/i), where the sum goes from 1 to k.

The advantage with normalized outputs is that they are more comparable across different k.

However with reciprocal distance it is not so clear. We do not want to normalize with the sum of the weights as we one should be able to see that samples with small distances are better classified than other samples. One could normalize with 1/k as in the uniform weight case.

Should we normalize or not? If we normalize, how? Any opinions?

Also, how should reciprocal distance weighting handle the case when a distance is 0?

comment:8 Changed 13 years ago by Peter

With uniform weights a normalization makes sense. For the two others I'm not sure. For reciprocal rank, one could do it the way markus suggests. For reciprocal distance I cant see a way to do it in a sensible way. I would vote for no normalization in non-uniform cases, but is not very important...

Here we talk about normalization to make KNNs with different K comparable. I would expect that the output somehow converge when K is growing.

comment:9 Changed 13 years ago by Peter

Regarding the zero distance case, in WeNNI we add a small number to the distance to avoid this problem. That is one solution. In theory I think zero distance should imply infinite output?

comment:10 Changed 13 years ago by Markus Ringnér

In |1112] I changed uniform weights to be without normalization as I think this is what people expect from KNN. I also added support for the other two weighting methods, both without normalization. So now all weighting methods are unnormalized, I think this makes the best sense.

The implementation of weighting with reciprocal distance returns inf as the output if the distance is zero. I think this is the best choice. A user selecting the class with largest distance will do fine with this choice. And a user that wants to regularize infinities can do it outside of KNN.

comment:11 Changed 13 years ago by Markus Ringnér

Resolution: fixed
Status: assignedclosed

This ticket has been fixed in KNN.h with [1112].

Leftovers:

  • How to truncate outputs to a winning class is related to the discussion in ticket:270 and does not belong here. Perhaps it should be its own ticket but I think it can stay in ticket:270 until addressing this issue gets momentum.
  • If there is a tie among the distances the tied neighbors end up in some order. This has two consequences which have implications if the tied neighbors belong to different classes. (i) If the tied neighbors have rank k and k+1 only one of them will be used in the predictions. (ii) If the tied neigbors have rank j and j+1 with j+1<=k then their votes will be different in e.g. reciprocal rank weighting of neighbor votes. I think we can live with this.
Note: See TracTickets for help on using tickets.