#535 closed defect (fixed)
qQuantileNormalization does not work with missing values represened with NaNs and weight 0
Reported by: | Jari Häkkinen | Owned by: | Jari Häkkinen |
---|---|---|---|
Priority: | critical | Milestone: | yat 0.5.3 |
Component: | normalizer | Version: | 0.5.2 |
Keywords: | Cc: |
Description (last modified by )
related to ticket:536
The reason for this that the weighted normalize function in qQuantileNormalize class does not perform a proper sorting with NaN values, the
std::vector<size_t> sorted_index(last-first); utility::sort_index(utility::data_iterator(first), utility::data_iterator(last), sorted_index);
call fails to sort the vector. The following printout in the function
traits2.data(result+si) = traits1.data(first+si) - correction; std::cout << "2: " << traits2.data(result+si) << " = 1: " << traits1.data(first+si) <<" - c: " << correction << std::endl;
should produce an ascending order of traits1.data(first+si) which is note the case:
2: 1000.98 = 1: 1000.97 - c: -0.00846884 2: 1001.01 = 1: 1001 - c: -0.00846884 2: 1001.15 = 1: 1001.14 - c: -0.00846884 2: nan = 1: nan - c: -0.00846884 2: 9326.19 = 1: 9326.18 - c: -0.00846884 2: 1001.22 = 1: 1001.21 - c: -0.00846884 2: nan = 1: nan - c: -0.00846884 2: 17830.4 = 1: 17830.4 - c: -0.00846884 2: 1928.96 = 1: 1928.95 - c: -0.00846884 2: 3787.78 = 1: 3787.77 - c: -0.00846884 2: 3537.43 = 1: 3537.42 - c: -0.00846884 2: nan = 1: nan - c: -0.00846884 2: 5378.82 = 1: 5378.81 - c: -0.00846884 ...
Change History (21)
comment:1 follow-up: 3 Changed 14 years ago by
comment:2 follow-up: 4 Changed 14 years ago by
Similar bugs occur elsewhere.
- VectorMutable::sort uses
std::sort
withless<double>
.
- Percentiler has the exact same problem.
- And we need to check in the Classifiers too.
I do not think we should fix these bugs within 0.5.x. Should we implement the work-around for the original bug?
comment:3 follow-up: 5 Changed 14 years ago by
Replying to peter:
OK, finally I understand the problem.
I suggest two different solutions:
- For the trunk I think we can create a functor that makes the sorting OK. It can for example treat NaN as inf, which would solve the problem.
- We need to fix this for 0.5.x too. Can we replace NaN with inf and then afterwards replace inf with NaN. That is of course not good in case data contains inf, so perhaps we should remember which values we modified in order to avoid extra re-modifications.
If we expect future versions to keep NaN as NaN without messing with Infs after normalization we should restore NaN already in 0.5.3. Since the book keeping of NaN is needed anyway one can simply replace NaN with 0 (or really any number since it is 0 weight anyway ... and yes Inf will do also).
In qQN.cc there is an example of a work around.
comment:4 follow-up: 6 Changed 14 years ago by
Replying to peter:
I do not think we should fix these bugs within 0.5.x. Should we implement the work-around for the original bug?
I can live without the work around in 0.5.x but then we should release a 0.6 ASAP targeting this bug. The work around for the original bug is straight forward, but do we want to do the same everywhere?
comment:5 Changed 14 years ago by
Replying to jari:
If we expect future versions to keep NaN as NaN without messing with Infs after normalization we should restore NaN already in 0.5.3. Since the book keeping of NaN is needed anyway one can simply replace NaN with 0 (or really any number since it is 0 weight anyway ... and yes Inf will do also).
Yes, it doesn't matter so much which number we choose, but we should be consistent for all fixes and the fix in 0.5.3 should give the same behavior as the fix in trunk. Therefore, I think inf
is a natural choice.
comment:6 Changed 14 years ago by
Replying to jari:
Replying to peter:
I do not think we should fix these bugs within 0.5.x. Should we implement the work-around for the original bug?
I can live without the work around in 0.5.x but then we should release a 0.6 ASAP targeting this bug. The work around for the original bug is straight forward, but do we want to do the same everywhere?
OK, let's implement the workarounds in 0.5.3 everywhere.
comment:7 Changed 14 years ago by
Description: | modified (diff) |
---|
created a ticket:536 for the trunk implementation
comment:8 Changed 14 years ago by
A better solution (without changing the API) is to declare the functor locally in the function in question. I will try that solution for the VectorMutable::sort
.
comment:9 Changed 14 years ago by
comment:10 Changed 14 years ago by
(In [1926]) Ignoring zero weights in Percentiler so algorithm works also when {value=nan, weight=0}
comment:11 Changed 14 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
(In [1930]) avoid copying in MatrixWeighted?(istream), i.e., avoid use of intermediate Matrix. closes #535
comment:12 Changed 14 years ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
incorrect message in r1930
comment:14 Changed 14 years ago by
Owner: | changed from Peter to Jari Häkkinen |
---|---|
Status: | reopened → new |
comment:15 Changed 14 years ago by
Status: | new → assigned |
---|
comment:16 Changed 14 years ago by
comment:17 Changed 14 years ago by
(In [1946]) Addresses #535. Another fix for the poor implentation in changeset:1943, now the test actually test what was intended.
comment:18 Changed 14 years ago by
comment:19 Changed 14 years ago by
comment:20 Changed 14 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Resolved since changeset:1948
OK, finally I understand the problem.
I suggest two different solutions: