Opened 12 years ago

Closed 12 years ago

Last modified 4 years ago

#535 closed defect (fixed)

qQuantileNormalization does not work with missing values represened with NaNs and weight 0

Reported by: Jari Häkkinen Owned by: Jari Häkkinen
Priority: critical Milestone: yat 0.5.3
Component: normalizer Version: 0.5.2
Keywords: Cc:

Description (last modified by Peter)

related to ticket:536

The reason for this that the weighted normalize function in qQuantileNormalize class does not perform a proper sorting with NaN values, the

std::vector<size_t> sorted_index(last-first);
utility::sort_index(utility::data_iterator(first),
         utility::data_iterator(last), sorted_index);

call fails to sort the vector. The following printout in the function

traits2.data(result+si) = traits1.data(first+si) - correction;
std::cout << "2: " << traits2.data(result+si) << " = 1: " << 
traits1.data(first+si) <<" - c: " << correction << std::endl;

should produce an ascending order of traits1.data(first+si) which is note the case:

2: 1000.98 = 1: 1000.97 - c: -0.00846884
2: 1001.01 = 1: 1001 - c: -0.00846884
2: 1001.15 = 1: 1001.14 - c: -0.00846884
2: nan = 1: nan - c: -0.00846884
2: 9326.19 = 1: 9326.18 - c: -0.00846884
2: 1001.22 = 1: 1001.21 - c: -0.00846884
2: nan = 1: nan - c: -0.00846884
2: 17830.4 = 1: 17830.4 - c: -0.00846884
2: 1928.96 = 1: 1928.95 - c: -0.00846884
2: 3787.78 = 1: 3787.77 - c: -0.00846884
2: 3537.43 = 1: 3537.42 - c: -0.00846884
2: nan = 1: nan - c: -0.00846884
2: 5378.82 = 1: 5378.81 - c: -0.00846884
...

Change History (21)

comment:1 Changed 12 years ago by Peter

OK, finally I understand the problem.

I suggest two different solutions:

  1. For the trunk I think we can create a functor that makes the sorting OK. It can for example treat NaN as inf, which would solve the problem.
  1. We need to fix this for 0.5.x too. Can we replace NaN with inf and then afterwards replace inf with NaN. That is of course not good in case data contains inf, so perhaps we should remember which values we modified in order to avoid extra re-modifications.

comment:2 Changed 12 years ago by Peter

Similar bugs occur elsewhere.

  • VectorMutable::sort uses std::sort with less<double>.
  • Percentiler has the exact same problem.
  • And we need to check in the Classifiers too.

I do not think we should fix these bugs within 0.5.x. Should we implement the work-around for the original bug?

comment:3 in reply to:  1 ; Changed 12 years ago by Jari Häkkinen

Replying to peter:

OK, finally I understand the problem.

I suggest two different solutions:

  1. For the trunk I think we can create a functor that makes the sorting OK. It can for example treat NaN as inf, which would solve the problem.
  1. We need to fix this for 0.5.x too. Can we replace NaN with inf and then afterwards replace inf with NaN. That is of course not good in case data contains inf, so perhaps we should remember which values we modified in order to avoid extra re-modifications.

If we expect future versions to keep NaN as NaN without messing with Infs after normalization we should restore NaN already in 0.5.3. Since the book keeping of NaN is needed anyway one can simply replace NaN with 0 (or really any number since it is 0 weight anyway ... and yes Inf will do also).

In qQN.cc there is an example of a work around.

comment:4 in reply to:  2 ; Changed 12 years ago by Jari Häkkinen

Replying to peter:

I do not think we should fix these bugs within 0.5.x. Should we implement the work-around for the original bug?

I can live without the work around in 0.5.x but then we should release a 0.6 ASAP targeting this bug. The work around for the original bug is straight forward, but do we want to do the same everywhere?

comment:5 in reply to:  3 Changed 12 years ago by Peter

Replying to jari:

If we expect future versions to keep NaN as NaN without messing with Infs after normalization we should restore NaN already in 0.5.3. Since the book keeping of NaN is needed anyway one can simply replace NaN with 0 (or really any number since it is 0 weight anyway ... and yes Inf will do also).

Yes, it doesn't matter so much which number we choose, but we should be consistent for all fixes and the fix in 0.5.3 should give the same behavior as the fix in trunk. Therefore, I think inf is a natural choice.

comment:6 in reply to:  4 Changed 12 years ago by Peter

Replying to jari:

Replying to peter:

I do not think we should fix these bugs within 0.5.x. Should we implement the work-around for the original bug?

I can live without the work around in 0.5.x but then we should release a 0.6 ASAP targeting this bug. The work around for the original bug is straight forward, but do we want to do the same everywhere?

OK, let's implement the workarounds in 0.5.3 everywhere.

comment:7 Changed 12 years ago by Peter

Description: modified (diff)

created a ticket:536 for the trunk implementation

comment:8 Changed 12 years ago by Peter

A better solution (without changing the API) is to declare the functor locally in the function in question. I will try that solution for the VectorMutable::sort.

comment:9 Changed 12 years ago by Peter

(In [1924]) Fixing NaN treatment in BectorMutable::sort (refs #535). The plan with using a local functor did not work out well

comment:10 Changed 12 years ago by Peter

(In [1926]) Ignoring zero weights in Percentiler so algorithm works also when {value=nan, weight=0}

comment:11 Changed 12 years ago by Peter

Resolution: fixed
Status: newclosed

(In [1930]) avoid copying in MatrixWeighted?(istream), i.e., avoid use of intermediate Matrix. closes #535

comment:12 Changed 12 years ago by Peter

Resolution: fixed
Status: closedreopened

incorrect message in r1930

comment:13 Changed 12 years ago by Jari Häkkinen

(In [1943]) Addresses #535. Added a test that fails on nan

comment:14 Changed 12 years ago by Jari Häkkinen

Owner: changed from Peter to Jari Häkkinen
Status: reopenednew

comment:15 Changed 12 years ago by Jari Häkkinen

Status: newassigned

comment:16 Changed 12 years ago by Jari Häkkinen

(In [1945]) Addresses #535. Changeset:1943 contained defect test for the issue ... it fails but due to other reasons than what we try to catch, this changeset fixes the test for nan in source.

comment:17 Changed 12 years ago by Jari Häkkinen

(In [1946]) Addresses #535. Another fix for the poor implentation in changeset:1943, now the test actually test what was intended.

comment:18 Changed 12 years ago by Jari Häkkinen

(In [1947]) Addresses #535. NaNs? are replaced with Inf in sorting in qQuantileNormalizer.

comment:19 Changed 12 years ago by Jari Häkkinen

(In [1948]) Addresses #535. No need to replace NaNs? in the unweighted case. Peter, thanks for pointing this out for me.

comment:20 Changed 12 years ago by Jari Häkkinen

Resolution: fixed
Status: assignedclosed

Resolved since changeset:1948

comment:21 Changed 4 years ago by Peter

ticket #875 was marked as related

Note: See TracTickets for help on using tickets.