#335 closed defect (fixed)
training on empty classes in NCC/KNN/NBC
Reported by: | Peter | Owned by: | Peter |
---|---|---|---|
Priority: | minor | Milestone: | yat 0.4 |
Component: | classifier | Version: | trunk |
Keywords: | Cc: |
Description (last modified by )
Unweighted training in NCC assumes that every class is represented in data. Typically this is true, but not always. There is a comment in Target::nof_classes saying
\note there might be empty classes, i.e., classes with zero samples. This may happen when using Target(const std::vector<std::string>& labels, const Target&) or Target(const Target& org, const std::vector<size_t>& vec).
This implies that you can have a target such as {0,0,0,2,2,2} in which case nof_classes() will return 3 but the centroid for class 1 is nonsense. Therefore I think we should check if Averager is empty, and if so flag that centroid with a Nan. This is exactly how it is done in weighted case in which there is a check weighted w_sum() is equal to zero.
Change History (20)
comment:1 Changed 16 years ago by
Description: | modified (diff) |
---|
comment:2 follow-up: 3 Changed 16 years ago by
comment:3 Changed 16 years ago by
Replying to peter:
- If there are no training samples for a class, clearly we cant predict and we should return Nan.
I agree.
- If there is not sufficient data for a feature in a class. In NCC it could be that there is no non-zero weight for a feature in a class, and we can't calculate the average for that feature. Should this feature be treated as a zero weight or should we return Nan for this class (there is not sufficient data to give a prediction).
I think we should produce a prediction and not NaN for this class. However, if we instead return NaN, I guess we are saying to users to first select a set of features that are well-represented in their training data before training a classifier. This last alternative is ok with me too, but I think I prefer the first.
comment:4 Changed 16 years ago by
Ok, I was unclear. Items above are not competing alternatives. Instead consider them as two independent yes/no questions helping us tho specify classifiers.
I'll try to clarify 2): I see the training data as a matrix, a matrix for each class, in which each row is a feature and each column is a sample. In NCC you train by taking the average of each row and thereby getting a centroid. Item 2 above ask how we handle the case when a row is completely missing. Either we try to predict e.g. in NCC using by calculating the distance using a weighted Distance (to handle the missing value). Or we simply say that there is not sufficient data to make a prediction and return a Nan for this particular class.
Item 1 one can see as a subset of item 2. How to handle when all rows are Nan, either because we have no columns or because data only contains missing values. In any case, I think the only reasonable return in this case can be to return a Nan.
Question 2 is a more open question. I think I have suggested that when a centroid has missing values that we predict as good as we can. I can go with that. I realize now while writing that there is a third case: What happens if centroid and test_vector has no overlapping pair of non-zero weights.
comment:5 Changed 16 years ago by
Ok I was apparently even more unclear. Regarding your items 1 and 2, I definetely agree with that we should return NaN for item 1. For your item 2, you suggest two different solutions and these two are competing alternatives. I tried to say that I preferred the first alternative but could accept the second alternative. So when I was talking about the first and last alternative I was referring to your two alternatives for item 2 and not to items 1 and 2. Now you mention a third alternative for item 2, which I think should produce an NaN similarily to item 1, regardless of what we do with the other item 2 cases.
comment:6 Changed 16 years ago by
Ok, that is fine with me. I have to check into NBC so it behaves as we want.
Regarding my last note, the case of when there is no overlap in weights, say:
test_data = {123, 1, 232, 322} test_w = {0, 0, 0, 0.98} centroid_ = {1.23, 1.12, 1.11, Nan}
In this case the distance will be - well it depends on which Distance functor you use. With PearsonDistance you get 1.0 because AveragerPairWeighted::correlation returns 0.0 for this case. If you instead use EuclideanDistance you will get Nan because in AveragerPairWeighted::msd there is no special check when sum_w is zero so it returns 0.0/0.0
.
I start to think that Averager classes are a bit weird and dangerous returning zero when they are empty. I mean, sure, we can have all these checks outside of class such as
double d=Nan; if (averager.sum_w()) d = averager.some_function();
or using a one-liner syntax. But isn't it more logical that Averager classes return Nan and then if you want the zero behavior you have to check it yourself
double d = 0; if (averager.sum_w()) d = averager.some_function();
I'm happy to hear your thoughts. I guess you think I'm a progressive youngster again who wanna change change change... but I'm happy to listen tour experience.
comment:7 Changed 16 years ago by
I suggested returning NaNs? in AveragerWeighted? when sum_w=0 in an e-mail a week ago, but got voted down by Peter and Jari (well ok Jaris response was a bit more confusing but ended up with voting for returning 0). However, Peters motivation was that AveragerWeighted? should follow Averager and Averager has a lot of "if (cant-calculate) return 0" code. So, if someone is willing to fix it to NaN consistently throughout these classes, I definately vote for returning NaNs?.
comment:8 Changed 16 years ago by
More info regarding this Nan or 0 issue: I loooked into WeNNI implementation. It uses its own distance calculation (was implemented before all these Distance and Averager classes). If there is no overlap in non-zero weights 0/0 is returned. So it behaves like EuclideanDistance, which is good because it is a EuclideanDistance? (but scaled).
comment:9 Changed 16 years ago by
I'm willing to do that change. It is a quick fix and will speed up Averager classes (slightly).
However, I wanna hear Jari's comment first.
comment:10 Changed 16 years ago by
Owner: | changed from Markus Ringnér to Jari Häkkinen |
---|
comment:11 Changed 16 years ago by
I have a two comments.
Make the new implementation consistent, NaN or 0.
And remember, refactoring is good ... and it is bad ... and it is hard to know when to stop. The reason for having a marketing department is that someone will say stop!
comment:12 Changed 16 years ago by
refs #335 - Changed so Averager classes are consistently returning NaN when Averager is empty or for some other reason the estimation ends up with things like zero by zero division. Previously zero was returned from some functions and Nan from some functions. I did not change anythuing in NCC.
comment:13 Changed 16 years ago by
Owner: | changed from Jari Häkkinen to Markus Ringnér |
---|
comment:14 Changed 16 years ago by
Status: | new → assigned |
---|---|
Summary: | training on empty classes in NCC → training on empty classes in NCC/KNN/NBC |
comment:15 follow-up: 16 Changed 16 years ago by
In [1142] (with an additional test in [1143]) things have been addressed as follows:
For NCC:
- If a class has no training samples all samples get prediction NaN for that class.
- If a test sample and a centroid have no overlap in variables with non-zero weights, the sample gets prediction NaN for that class, but the sample may get non-NaN predictions for other classes and the class may get non-NaN predictions for other samples.
With these revisions I view this ticket as fixed for NCC.
For KNN:
- If a class has no training samples all samples get prediction NaN for that class.
- If a test sample and a training sample (tr1) has no overlap in variables with non-zero weights it is not clear to me how this should be addressed. If tr1 is not among the k nearest neighbors, I guess it should not affect things. If tr1 is among the k nearest neighbors, it means that there are not k neighbors with distance != nan. Should only the neighbors with non-nan distances vote? If so this test sample will gets less votes than k (and possibly less votes than other test samples). I guess things could be rescaled to k, but this seems like complicating things for an exceptional case. Comments?
- For KNN I have changed so k gets lowered to the number of training samples if k was larger than this. Alternatively this could be yat_asserted. Comments?
With these revisions the original concern with this ticket is resolved, but additional concerns raised in the ticket comments are not.
Left-overs
- I have not fixed anything related to this for NBC.
comment:16 Changed 16 years ago by
Replying to markus:
In [1142] (with an additional test in [1143]) things have been addressed as follows:
For NCC:
- If a class has no training samples all samples get prediction NaN for that class.
- If a test sample and a centroid have no overlap in variables with non-zero weights, the sample gets prediction NaN for that class, but the sample may get non-NaN predictions for other classes and the class may get non-NaN predictions for other samples.
With these revisions I view this ticket as fixed for NCC.
For KNN:
- If a class has no training samples all samples get prediction NaN for that class.
- If a test sample and a training sample (tr1) has no overlap in variables with non-zero weights it is not clear to me how this should be addressed. If tr1 is not among the k nearest neighbors, I guess it should not affect things. If tr1 is among the k nearest neighbors, it means that there are not k neighbors with distance != nan. Should only the neighbors with non-nan distances vote? If so this test sample will gets less votes than k (and possibly less votes than other test samples). I guess things could be rescaled to k, but this seems like complicating things for an exceptional case. Comments?
When there is no overlap in non-zero weights, I think Distance should return NaN (refs #262). So the question what happens when sorting the neighbors. If std is used sorting is undefined; for gsl it is a possible eternal sorting, right? The reasonable would be to interpret NaN as Inf in this context to avoid Nearest Neighbors to have distance NaN. If a Nearest Neighbor is still NaN, I guess the votes from these samples could be ignored. Rescaling makes no sense, in particular not when more advanced voting systems are used.
- For KNN I have changed so k gets lowered to the number of training samples if k was larger than this. Alternatively this could be yat_asserted. Comments?
I think that is fine.
With these revisions the original concern with this ticket is resolved, but additional concerns raised in the ticket comments are not.
Left-overs
- I have not fixed anything related to this for NBC.
I'll look into NBC. Give me this ticket when you think you are done with NCC and KNN.
comment:17 Changed 16 years ago by
Owner: | changed from Markus Ringnér to Peter |
---|---|
Status: | assigned → new |
In [1156] fixed for KNN as follows. If a distance is NaN it is set to infinity and nearest neighbors with distance infinity are not voting. This means that if a test sample and a training sample has no variables with non-zero weights in common then the training sample does not vote for this test sample. The only disadvantage with this solution is that true infinity distances are not separable from nan-turned-infinity distances, but the distance are not accessible from outside of KNN anyway.
Now only NBC remains to be fixed.
comment:18 Changed 16 years ago by
Status: | new → assigned |
---|
comment:19 Changed 16 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
(in [1184]) Fixed NBC predict. Implementation of weighted test data was changed.
In the case when there is not sufficient data to estimate all features in a specific class, predictions will result in NaN for that class. This is different than we decided for NCC (see discussion above). I tried to solve it similar to how it was done in NCC (and KNN), but could a way to do it a sensible way. I'll sketch the reason below:
The prediction output is a posterior probability that a test sample with data x belongs to class c: P(c|x) = P(c) * P(x|c)/P(x) where the rhs terms are the prior, model, and the probability to have sample with data x. Typically the last term can be ignored because it is constant with respect to c. The first term, the prior, I have chosen to be the same for all classes. One could perhaps argue that the prior should reflect the proportions of the training set. Anyway, the thing is that if we cannot estimate all parameters, say that, we cannot estimate mean and variance for feature i. Then we should remove that feature from the equation above. In the prior it is no problem as features are not involved in prior. In the second term, it is trivial to remove it (or technically integrate over it). The problem occurs in the last term. If a feature is missing P(x) is no longer constant for different classes.
Therefore, I chose to return NaN in this case. The question is whether other classes should conform to this behavior? I don't think so. I think different classifier can behave differently as long as it is documented.
If you agree, I think we can close this ticket.
comment:20 Changed 16 years ago by
I agree. I am currently writing documentation for NCC and KNN (ticket:75) and realized KNN and NCC behave somewhat differently and came to the same conclusion: that this should be documented rather than changed in the implementation.
This seems to be a problem in NBC too.
I think we need some general decision what is needed for a classifier to return a prediction. This is related to discussion in ticket:259
Let's define it for NCC and then other classes can follow: