Opened 14 years ago

Closed 13 years ago

#259 closed defect (fixed)

KNN/NBC/NCC should support both MatrixLookup and MatrixLookupWeighted

Reported by: Peter Owned by: Markus Ringnér
Priority: major Milestone: yat 0.4
Component: classifier Version: trunk
Keywords: Cc:

Description (last modified by Markus Ringnér)

Matrix-based classifiers such as KNN, NBC and NCC should support both MatrixLookup? and MatrixLookupWeighted?. In particular, the two functions make_classifier and predict should work for both types of lookups and they should throw a run-time error if use of some other kind of lookup is attempted.

If both training and test data are unweighted a faster implementation not depending on weights at all should exist.

Depends on ticket:287 and will benefit from ticket:267

Change History (28)

comment:1 Changed 14 years ago by Markus Ringnér

This was only because I wanted a decision on ticket:250. Should NCC and KNN be templates as is my current opinion, or should we pursue the adapter path. To exemplify, I implemented KNN as a template and NCC with an adapter. I did not want to do too much work in both directions: so now both of them only works for MatrixLookupWeighted?. Once we get a decision on ticket:250 both classifiers will work for both weighted and unweighted.

comment:2 Changed 14 years ago by Peter

Fair enough

comment:3 Changed 14 years ago by Markus Ringnér

Status: newassigned

comment:4 Changed 14 years ago by Markus Ringnér

Description: modified (diff)
Summary: Does NCC::predict only accepts MatrixLookupWeighted?KNN/NBC/NCC should support both MatrixLookup and MatrixLookupWeighted

comment:5 Changed 14 years ago by Markus Ringnér

In [948] and [931] some steps have been taken in this direction.

Left-overs:

1) All combinations of unweighted and weighted with training and test are not supported yet.

2) In NBC: predict only MatrixLookupWeighted? is supported.

Future:

To simplify code needed to resolve this ticket, I think iterators for matrices will be really useful (see ticket:263).

comment:6 Changed 14 years ago by Peter

In NBC::predict all kinds of DataLookup2D are supported. A dynamic_cast is done to check if it is a MatrixLookupWeight?, if so weighted statistics is used else unweighted. This means that if someone passes a KernelLookup? unweighted statistics will be used and the prediction is calculated as though it were a data matrix (and not a kernel).

I think that is fine, or should we throw if a KernelLookup? is passed? I prefer not to.

comment:7 in reply to:  6 Changed 14 years ago by Markus Ringnér

Replying to peter:

In NBC::predict all kinds of DataLookup2D are supported. A dynamic_cast is done to check if it is a MatrixLookupWeight?, if so weighted statistics is used else unweighted.

But the if-statement in the inner loop to check if things should be weighted or unweighted will result in the unweighted implementation being slower due to the support for weights, or?

Also, in NBC::train, weighted Averagers with all weights set to 1.0 are used when the training data is unweighted.

Should we ignore these two concerns and accept the NBC solution as an acceptable resolution to the specification that we should support unweighted classifiers for speed? With acceptance, I can remove duplicate code in NCC/KNN.

This means that if someone passes a KernelLookup? unweighted statistics will be used >and the prediction is calculated as though it were a data matrix (and not a kernel).

I think that is fine, or should we throw if a KernelLookup? is passed? I prefer not to.

I have memories (perhaps incorrect) of you arguing that the predict functions should complain if lookups of incorrect types are passed, which is why I added this to NCC and KNN. Either way is fine with me, so should I change in NCC and KNN to comply with preferring not to throw in this case?

comment:8 Changed 14 years ago by Peter

Since we support weighted data, I have to check whether data is weighted or not, and a check always costs time. If you are clever perhaps you can move the cost to compiler time.

Anyway, one could of course move the check outside the loop, and thereby gain some speed, but this would imply we need two identical loops in if else blocks, and I think that is stupid.

You are right. I am using AveragerWeighted? also for unweighted indata. I should probably change that .

I can't remember saying that, but fortunately I don't remember everything I say. I don't have strong opinion about this. Should we throw or just behave strange?

comment:9 in reply to:  8 Changed 14 years ago by Markus Ringnér

Replying to peter:

Since we support weighted data, I have to check whether data is weighted or not, and a check always costs time. If you are clever perhaps you can move the cost to compiler time.

No, one check is what I had in mind and is fine given the present structure.

If we ever get around this at compile-time, I think it should be through iterators in the 2D-lookups. If we get such iterators to work nicely then all the selection of weighted and unweighted could hopefully be done by the algorithms.

Anyway, one could of course move the check outside the loop, and thereby gain some speed, but this would imply we need two identical loops in if else blocks, and I think that is stupid.

That is what I wanted feedback on. I duplicated a lot of loops in NCC and KNN to make the unweighted case perform as close as possible in speed to what it would do in a world where NCC/KNN did not support weights. I thought such performace was a design specification. But I can easily live with your solution. On the other hand if unweighted does not result in much better performance than weighted, things in classifier could be much simpler by first converting unweighted things to weighted and only implement one path through functions.

You are right. I am using AveragerWeighted? also for unweighted indata. I should probably change that .

OK.

I can't remember saying that, but fortunately I don't remember everything I say. I don't have strong opinion about this. Should we throw or just behave strange?

Since it only requires one additional else at the end that never should be reached I think we can afford to throw. But if we decide not to it is fine with me, my main interest is to get all the classifiers similar.

comment:10 Changed 14 years ago by Peter

I have fixed the issues regarding NBC in [959], [960], and [961].

comment:11 Changed 13 years ago by Markus Ringnér

I have fixed the issues regarding NCC in [1007] as follows:

Weighted test data:
If training data is weighted, everything is weighted (=>centroids have weight 0 for missing data and weight 1 for data [we have previously agreed that once weighted averages are calculated the averages have weight 1]).
If training data is unweighted, weighted calculations are used with centroids having weight 1 everywhere.

Unweighted test data:
If training data is unweighted, everything is unweighted.
If training data is weighted, everyhing is still unweighted and this means that if there are missing values in the centroids then there will be NaNs in the final prediction result.

The last of the 4 cases can be a concern to someone. Reading Peter's code in NBC I think this is the way it was solved in NBC. I think this is fine but should be documented when the SupervisedClassifier structure finally is documented. What do you think, has NCC been completed for this ticket?

comment:12 Changed 13 years ago by Markus Ringnér

Regarding KNN.

If the NCC (and NBC?) solution above is ok (see [1007]), KNN should be solved in the same way. In KNN the training data is stored as a MatrixLookup or MatrixLookupWeighted.

This means if the test data is weighted, the training data should be weighted in the calculations. At which level are potential conversions of unweighted training data best done? And vice-versa: converting weighted training data to unweighted in the calculations if test data is unweighted. To do these things, I think it would be useful to have functionality to do:

classifier::!DataLookup1D(ml,i,false)
and
classifier::!DataLookupWeighted1D(ml,i,false)

such that both of these work regardless of whether ml is a MatrixLookup or a MatrixLookupWeighted (filling in unity weights and stripping away weights, respectively). Does this make sense or is there a much simpler solution to all of this?

comment:13 in reply to:  11 Changed 13 years ago by Peter

Replying to markus:

I have fixed the issues regarding NCC in [1007] as follows:

Weighted test data:
If training data is weighted, everything is weighted (=>centroids have weight 0 for missing data and weight 1 for data [we have previously agreed that once weighted averages are calculated the averages have weight 1]).
If training data is unweighted, weighted calculations are used with centroids having weight 1 everywhere.

Unweighted test data:
If training data is unweighted, everything is unweighted.
If training data is weighted, everyhing is still unweighted and this means that if there are missing values in the centroids then there will be NaNs in the final prediction result.

The last of the 4 cases can be a concern to someone. Reading Peter's code in NBC I think this is the way it was solved in NBC. I think this is fine but should be documented when the SupervisedClassifier structure finally is documented. What do you think, has NCC been completed for this ticket?

The difference between NCC and NBC is that is NBC you're not only estimating first moment (centroid) but also second moment (variance). The question here is what will the estimations turn out to be if you have no data to base estimations on. In NBC you actually need two data point because estimating variance from one data point is pretty meaningless. In NBC mean and variance are set to Nan to flag this case. However, later on in prediction there is a check and if a feature is a Nan the feature is ignored (for that class). See line 160 for weighted case

 // taking care of NaN and missing training features
 if (mlw->weight(i, label) && !std::isnan(sigma2_(i, label))) {

and line 178 for unweighted case

 // Ignoring missing features
 if (!std::isnan(sigma2_(i, label)))

comment:14 in reply to:  12 Changed 13 years ago by Peter

Replying to markus:

Regarding KNN.

If the NCC (and NBC?) solution above is ok (see [1007]), KNN should be solved in the same way. In KNN the training data is stored as a MatrixLookup or MatrixLookupWeighted.

This means if the test data is weighted, the training data should be weighted in the calculations. At which level are potential conversions of unweighted training data best done? And vice-versa: converting weighted training data to unweighted in the calculations if test data is unweighted. To do these things, I think it would be useful to have functionality to do:

classifier::!DataLookup1D(ml,i,false)
and
classifier::!DataLookupWeighted1D(ml,i,false)

such that both of these work regardless of whether ml is a MatrixLookup or a MatrixLookupWeighted (filling in unity weights and stripping away weights, respectively). Does this make sense or is there a much simpler solution to all of this?

Essentially you wish a classifier::!DataLookupWeighted1D(ml,i,false)

I don't think that is trivial because !DataLookupWeighted1D internally stores a MatrixLookupWeighted, a bool flagging for row/column vector, and size_t telling which row/column. This implies one would need to be able to construct a MatrixLookupWeighted from a MatrixLookup. I think this is wrong way to go. Instead we should solve this using iterators, which is suggested elsewhere. If we get a well working iterators on matrix class, !DataLookupWeighted1D will be deprecated...

comment:15 Changed 13 years ago by Jari Häkkinen

Description: modified (diff)

comment:16 Changed 13 years ago by Markus Ringnér

Description: modified (diff)

comment:17 Changed 13 years ago by Markus Ringnér

If row/column iterators get implemented according to ticket:267 and Supervised classifier gets restructured according to ticket:287 such that predict will turn into two functions predict(MatrixLookup) and predict(MatrixLookupWeighted) for NCC/KNN/NBC, there are still some issues to discuss. I am using NCC as an example.

Train
This is straighforward with two different functions one for unweighted and one for weighted training data. The resulting centroids are stored in a matrix and by definition the centroids are unweighted (the weights have been used to calculate averages). However in the weighted case the centroids may have missing values which have to be taken care of when predicting.

Predict

predict(MatrixLookupWeighted)

This case is straightforward since weighted calculations have to be used anyway, a MatrixLookupWeighted can be constructed from the centroids matrix regardless of whether the centroids have missing values or not.

predict(MatrixLookup)

This is the case for which it has been difficult to get a single implementation. If the centroids contain no missing values they can be used directly or through a MatrixLookup (depending on whether matrix itself has row/column iterators) and a completely unweighted implementation is used. If the centroids have missing values, a weighted calculation has to be used similar to the first predict above (but this function can not be called unless unity weights are added to the test data). This means two essentially identical implementations will exist in this function making for a total of three very similar implementations instead of two (weighted/unweighted). So how does one turn this into a single implementation? I do not think one can hide it in the call to distance since the iterator types would have to be known at compile-time. Perhaps this is more clear in KNN where the training data is stored as the base class of MatrixLookup and MatrixLookupWeighted (both types of training data should be supported), then a call to such as distance(training_data.col_begin(i),training_data.col_end(i),test_data.col_begin(j)) will not work since the template argument to the first iterator type to distance cannot be resolved based on the dynamic type of the training data (non-existent at compile time). If this worked one could even get away with a single predict function and hide all the weight/unweight decisions in distance.

So, any suggestions for a design to address this issue? Templates do not seem to be the way. I guess we do not want to make the classifiers templatized on both training and test data types as this should be more flexible than being decided at compile-time. Making the member function templates rather than a template class does not work either since we can not templatize virtual functions. If we remove the base class SupervisedClassifier when we implement ticket:287 perhaps things open up a bit for a template solution. But for KNN the training data type would have to be decided when the KNN is constructed, whereas the predict functions could turn into a single templatized function.

comment:18 Changed 13 years ago by Peter

In context of ticket:287 I am investigating what happens if we remove base class DataLookup2D. If decide to do that we cannot store data in KNN as DataLookup2D* but must store the data as two pointers MatrixLookup* and MatrixLookup2D of which one is NULL. What are the consequenses for that here?

comment:19 in reply to:  17 ; Changed 13 years ago by Peter

Replying to markus:

predict(MatrixLookup)

This is the case for which it has been difficult to get a single implementation. If the centroids contain no missing values they can be used directly or through a MatrixLookup (depending on whether matrix itself has row/column iterators) and a completely unweighted implementation is used. If the centroids have missing values, a weighted calculation has to be used similar to the first predict above (but this function can not be called unless unity weights are added to the test data).

What is the problem with adding unity weights? Remember that you can construct a MatrixLookupWeighted from a MatrixLookup? and it is very cheap (the weights are stored in a 1x1 Matrix.

comment:20 in reply to:  18 Changed 13 years ago by Markus Ringnér

Replying to peter:

In context of ticket:287 I am investigating what happens if we remove base class DataLookup2D. If decide to do that we cannot store data in KNN as DataLookup2D* but must store the data as two pointers MatrixLookup* and MatrixLookup2D of which one is NULL. What are the consequenses for that here?

I think that will work fine. Only a very minor constant overhead in predict (but this may be compensated by the removal of dynamic_casts). We still need to support converting a MatrixLookup into a MatrixLookupWeighted with unity weights though.

comment:21 in reply to:  19 Changed 13 years ago by Markus Ringnér

Replying to peter:

If the centroids have missing values, a weighted calculation has to be used similar to the first predict above (but this function can not be called unless unity weights are added to the test data).

What is the problem with adding unity weights? Remember that you can construct a MatrixLookupWeighted from a MatrixLookup? and it is very cheap (the weights are stored in a 1x1 Matrix.

Well now I remember, we decided this was ok at some earlier time point. My concern was not the construction of the MatrixLookupWeighted but the overhead of multiplying every calculated value with one, only because there are some missing values in the centroids. But since our requirement is that fast with minimum overhead only should apply to unweighted throughout (both training and test) I guess I agree with you on adding the unity weights. Afterall, our design is similar throughout, we do not have things such as AveragerPairOneWeighted, but AveragerPairWeighted that multiples with 1.0 in some places when one item is unweighted.

comment:22 Changed 13 years ago by Markus Ringnér

I could not figure out which function(s) to use to convert a MatrixLookup? to a MatrixLookupWeighted? with unity weights. Help?

comment:23 Changed 13 years ago by Peter

My mistake. I mixed up with the utility::Matrix constructor.

We need this functionality. Should we do it as a function in MatrixLookup? or as a constructor in MatrixLookupWeighted?. In the latter case we need to make underlying utility::Matrix visible to MatrixLookupWeighted?. Either by friendship or a function in MatrixLookup? returning it.

I think I prefer friendship solution because why should we let everybody see private part of MatrixLookup?...

comment:24 Changed 13 years ago by Markus Ringnér

I think:

  • The version with a MatrixLookupWeighted constructor seems most appealing.
  • Friendship is better than an otherwise not wanted public function.

comment:25 Changed 13 years ago by Peter

Ok, I can create that constructor.

comment:26 Changed 13 years ago by Peter

ticket:318 was marked as related.

comment:27 Changed 13 years ago by Markus Ringnér

KNN fixed in [1107]

comment:28 Changed 13 years ago by Markus Ringnér

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.