Opened 12 years ago

Closed 12 years ago

## #478 closed discussion (fixed)

# Generalize qQuantileNormalizer to work also on weighted data

Reported by: | Peter | Owned by: | Peter |
---|---|---|---|

Priority: | minor | Milestone: | yat 0.5 |

Component: | normalizer | Version: | trunk |

Keywords: | Cc: |

### Description

related to ticket:401 (QuantileNormalizer)

How to define the algorithm when having weights around?

### Change History (39)

### comment:1 follow-up: 3 Changed 12 years ago by

### comment:2 Changed 12 years ago by

Step 2 should say partition sorted data! Obviously data is sorted before partitioning.

### comment:3 follow-up: 4 Changed 12 years ago by

Replying to peter:

I haven't thought this through very carefully, but I think it might be sufficient to generalize the qQuantileNormalizer::Partioner to also work on a weighted range. This implies that only step 2)- 4) need to be generalized. Step 5) and 6) can be performed as before.

A weighted Partioner should be straight forward to implement.

I agree, but how should the sorting be done? What is the rank of a poor high intensity value compared with a good high intensity value? Maybe you suggest to not use weights in sorting?

### comment:4 follow-up: 5 Changed 12 years ago by

Replying to jari:

Replying to peter:

I haven't thought this through very carefully, but I think it might be sufficient to generalize the qQuantileNormalizer::Partioner to also work on a weighted range. This implies that only step 2)- 4) need to be generalized. Step 5) and 6) can be performed as before.

A weighted Partioner should be straight forward to implement.

I agree, but how should the sorting be done? What is the rank of a poor high intensity value compared with a good high intensity value? Maybe you suggest to not use weights in sorting?

Yes, I thought the sort should ignore the weight. Noticed that there is no operator< in DataWeight. Perhaps one need to take special care of the w=0 data points, but I don't think so.

### comment:5 Changed 12 years ago by

Replying to peter:

Noticed that there is no operator< in DataWeight.

There is actually an operator< but it is declared outside the class, hm.

### comment:6 follow-up: 7 Changed 12 years ago by

I this something we want in 0.5, i.e., do you (jari) need it for the baseplugin or wont you support missing values?

### comment:7 Changed 12 years ago by

Replying to peter:

I this something we want in 0.5, i.e., do you (jari) need it for the baseplugin or wont you support missing values?

For the first release of the baseplugin there will be support for missing values but only with binary weights. To accomplish this I only need qQuantiles with proper q-number, and I'll have different amount of data values in the parts depending on the number of missing values in each assay.

But I agree, it is better if we get the weighted feature into 0.5.

### comment:8 Changed 12 years ago by

Milestone: | yat 0.x+ → yat 0.5 |
---|---|

Status: | new → assigned |

### comment:10 Changed 12 years ago by

### comment:11 follow-ups: 16 18 Changed 12 years ago by

Some problem! The unweighted version assumes that index of target and index of source are the same. This is obviously not true in the weighted case, so the algorithm needs to be revised.

Is this a problem also in the unweighted case when sizes of target and source are not equal?

### comment:12 Changed 12 years ago by

### comment:13 Changed 12 years ago by

Summary: | Generalize qQuantileNormalizer to work also on MatrixWeighted → Generalize qQuantileNormalizer to work also on weighted data |
---|

### comment:14 Changed 12 years ago by

### comment:15 Changed 12 years ago by

### comment:16 follow-ups: 17 19 Changed 12 years ago by

Replying to peter:

Some problem! The unweighted version assumes that index of target and index of source are the same. This is obviously not true in the weighted case, so the algorithm needs to be revised.

Is this a problem also in the unweighted case when sizes of target and source are not equal?

I understand the implementation as follows: the partitioner sorts the data and divides the N data points into Q buckets such that each bucket approximately contains N/Q data points. For each bucket the average data value and the average index is calculated - called averages and index, respectively.

We create one partitioner of the target and one for the source. As both are of size Q we can take the difference between averages: diff_averages = source_averages - target_averages This vector is then the input for the cSpline... together with target_index.

My concern is that if N_target != N_sources, the vectors target_index and source_index are typically not equal. Therefore, it is not obvious that it makes sense to interpolate {target_index; diff_averages} because it is not obvious that target_index is the most relevant choice.

Some solutions I can think of:

- Let it be as it is. (always an option)

- Require that N_target equals N_source, which will invalidate this discussion.

- Skip the calculation of index and just use {0, 1, 2, ...} instead. If N is a multiple of Q this will give identical result.

- Rather than taking the difference: target - source and then input the diff to cSpline, calculate cSpline of target and of source and calculate difference: cSpline(target) - cSpline(source). Then the two cSpline would be based on different index. Would this change the behavior when N_source = N_target? I'm not very familiar with cSpline and therefore cannot answer that question, but clearly for linear interpolation it would not change the result. Anyhow, the change would likely not be large, but the problem is that we diverge from the original algorithm described in the literature.

Thoughts?

### comment:17 Changed 12 years ago by

Replying to peter:

Thoughts?

I forgot the weighted context. In general the *index* can vary more when having arbitrary weights. 2) does not solve the weighted case because equally many data points does not guarantee same weight distribution.

### comment:18 Changed 12 years ago by

Replying to peter:

Some problem! The unweighted version assumes that index of target and index of source are the same. This is obviously not true in the weighted case, so the algorithm needs to be revised.

Is this a problem also in the unweighted case when sizes of target and source are not equal?

Yes. For my data it has not been a serious problem since the sizes are large and the difference in size is relatively small. There is a problem though, and we must address it.

### comment:19 Changed 12 years ago by

Replying to peter:

Replying to peter:

Is this a problem also in the unweighted case when sizes of target and source are not equal?

I understand the implementation as follows: the partitioner sorts the data and divides the N data points into Q buckets such that each bucket approximately contains N/Q data points. For each bucket the average data value and the average index is calculated - called averages and index, respectively.

We create one partitioner of the target and one for the source. As both are of size Q we can take the difference between averages: diff_averages = source_averages - target_averages This vector is then the input for the cSpline... together with target_index.

My concern is that if N_target != N_sources, the vectors target_index and source_index are typically not equal. Therefore, it is not obvious that it makes sense to interpolate {target_index; diff_averages} because it is not obvious that target_index is the most relevant choice.

I agree. If N_target > N_source there will be a discontinuity in the higher indexes, in the border between cspline interpolation and the linear extrapolation. Was that understandable? The effect is that the cspline interpolated result vector at last cspline point will not be equal to the first linear exptrapolation point (at the high index end).

The other case N_target < N_source is also problematic and will probably throw an exception if the size difference is too large.

Some solutions I can think of:

- Let it be as it is. (always an option)

No.

- Require that N_target equals N_source, which will invalidate this discussion.

No.

- Skip the calculation of index and just use {0, 1, 2, ...} instead. If N is a multiple of Q this will give identical result.

How will this help? I assume that the {0, 1, 2, ...} series is for the partitioner index? But I think this can be extend to a solution ... more in the next post (that will not be submitted immediately, sorry).

- Rather than taking the difference: target - source and then input the diff to cSpline, calculate cSpline of target and of source and calculate difference: cSpline(target) - cSpline(source). Then the two cSpline would be based on different index. Would this change the behavior when N_source = N_target? I'm not very familiar with cSpline and therefore cannot answer that question, but clearly for linear interpolation it would not change the result. Anyhow, the change would likely not be large, but the problem is that we diverge from the original algorithm described in the literature.

Well, the algorithm is not too well defined in the literature. The above maybe a possible solution but I have another coming up in a later post.

cspline is a 3rd degree polynomial fit to a series of data where the resulting interpolation function must yield exactly the value that was used as input. I.e., if one data pair input was f(10)=9, then the interpolating function must return 9 for a 10 argument. Between the defining points the fit is used. Based on this explanation of cspline, I do not think the current qQN implementation and your suggestion above will generally be the same for N_source=N_target.

### comment:20 follow-up: 22 Changed 12 years ago by

I think we can resolve the issue of different sized target and source with rescaling the range of the source to the target. This should also work for the weighted case.

The loop in the cspline interpolation should be changed to

double step=N_target/N_source; // careful ... integer division double this_cspline_argument=start; for (size_t row=start; row<=end; ++row) { size_t srow=sorted_index[row]; result[srow] = first[srow] - cspline.evaluate(this_cspline_argument); this_cspline_argument+=step; }

In this way a smaller N_source will be stretched to fit a larger range, and a larger N_source will be shrinked to a smaller range. Of course, N_source and N_target needs to be adjusted to actually be the size of the interpolation range sizes.

### comment:21 Changed 12 years ago by

I'm not really following here.

But I just realized that the vector index has very different role from how I pictured the algorithm. I need to re-think here because index scales with N in a way I did not anticipate.

### comment:22 follow-up: 24 Changed 12 years ago by

Replying to jari:

In this way a smaller N_source will be stretched to fit a larger range, and a larger N_source will be shrinked to a smaller range. Of course, N_source and N_target needs to be adjusted to actually be the size of the interpolation range sizes.

OK, the problem you fix here is different from the one I pointed out. The problem you fix is much more severe. Essentially what you are doing is to rescale the index in Partitioner to range between say 0 and 1. I would prefer that Partitioner took care of that scaling, but it doesn't matter.

Note, that my suggestions above was not intended to solve the problem you are talking about, but a different problem/question mark (I'll come back to that in a bit).

### comment:23 follow-up: 25 Changed 12 years ago by

It's difficult to explain the problem here, so let's look at this patch first.

Index: test/normalization_test.cc =================================================================== --- test/normalization_test.cc (revision 1769) +++ test/normalization_test.cc (working copy) @@ -126,6 +126,28 @@ using namespace normalizer; suite.err() << "Testing qQuantileNormalizer\n"; + + std::vector<double> t; + while (t.size()<1000) + t.push_back(t.size()); + std::cout << "target:" << std::endl; + qQuantileNormalizer pj_qqn(t.begin(), t.end(), 4); + + std::vector<double> s; + while (s.size()<10) + s.push_back(s.size()*10.0); + + std::cout << "\nsource:" << std::endl; + pj_qqn(s.begin(), s.end(), s.begin()); + + for (size_t i=0; i<s.size(); ++i) + std::cout << s[i] << std::endl; + + exit(0); + + std::vector<double> source; + + std::string data(test::filename("data/normalization_test.data")); if (utility::FileUtil(data.c_str()).permissions("r")) { suite.add(false); Index: yat/normalizer/qQuantileNormalizer.cc =================================================================== --- yat/normalizer/qQuantileNormalizer.cc (revision 1769) +++ yat/normalizer/qQuantileNormalizer.cc (working copy) @@ -58,9 +58,11 @@ for (unsigned int r=start; r<end; ++r) av.add(sortedvec(r)); average_(i) = av.mean(); - index_(i) = 0.5*(end+start-1); + index_(i) = 0.5*(end+start-1.0)/sortedvec.size(); start=end; } + std::cout << "index: " << index_ << std::endl; + std::cout << "average: " << average_ << std::endl; } @@ -97,7 +99,7 @@ throw std::runtime_error(ss.str()); } average_(i) = av.mean(); - index_(i) = sum_w + 0.5*av.sum_w(); + index_(i) = (sum_w + 0.5*av.sum_w())/total_w; sum_w += av.sum_w(); } }

The changes in `yat/normalizer/qQuantileNormalizer.cc`

is just intended to fix the rescaling problem and also I added some debug output.

In the test we create two *linear datasets* that is the value depend on the index in a linear fashion (a*x+b). However, the constant a is different for source and target and I'd expect the normalization to find that difference and *normalize* it.

That is not the case, but instead I get this output:

target: index: 0.1245 0.3745 0.6245 0.8745 average: 124.5 374.5 624.5 874.5 source: index: 0.05 0.3 0.55 0.8 average: 5 30 55 80 119.5 804.5 814.5 824.5 834.5 844.5 854.5 864.5 874.5 884.5

The problem is how we are treating the *index*. Even after my rescaling, the *index* vectors are not the same, which implies that taking the difference between the average vectors doesn't really make sense.

It's like if we compare the weather in Kulladal and Rockville, and we take the temperature in Rockville Feb 4 and temperature in Kulladal Apr 1. The difference in temperature has no meaning (at least not to me). See what I'm saying?

### comment:24 follow-up: 26 Changed 12 years ago by

Replying to peter:

OK, the problem you fix here is different from the one I pointed out. The problem you fix is much more severe. Essentially what you are doing is to rescale the index in Partitioner to range between say 0 and 1. I would prefer that Partitioner took care of that scaling, but it doesn't matter.

The partitioner can take care of the scaling. If you know how you'd like to do it, please go ahead and fix it.

### comment:25 follow-up: 29 Changed 12 years ago by

Replying to peter:

target: index: 0.1245 0.3745 0.6245 0.8745 average: 124.5 374.5 624.5 874.5 source: index: 0.05 0.3 0.55 0.8 average: 5 30 55 80

I am not really following you, I have to think about it. However, shouldn't the index be same for the ends?

### comment:26 Changed 12 years ago by

Replying to jari:

Replying to peter:

OK, the problem you fix here is different from the one I pointed out. The problem you fix is much more severe. Essentially what you are doing is to rescale the index in Partitioner to range between say 0 and 1. I would prefer that Partitioner took care of that scaling, but it doesn't matter.

The partitioner can take care of the scaling. If you know how you'd like to do it, please go ahead and fix it.

Hm, changing in the Partitioner was not sufficient, you need adjustment in the normalization too, so your suggestion is probably preferable. Because of this the normalization output above is bogus.

### comment:28 Changed 12 years ago by

I re-run the test again:

target= {0, 1, ..., 999} source = {0, 10, ..., 90}

giving the following result:

target averages: 124.5 374.5 624.5 874.5 source averages: 5 30 55 80 diff averages: -119.5 -344.5 -569.5 -794.5 target index: 124.5 374.5 624.5 874.5 target index normalized: 1.245 3.745 6.245 8.745 source index: 0.5 3 5.5 8 source_org source_normalized 0 119.5 10 129.5 20 207.45 30 307.45 40 407.45 50 507.45 60 607.45 70 707.45 80 807.45 90 884.5

Looks good except the end points.

### comment:29 Changed 12 years ago by

Replying to jari:

Replying to peter:

target: index: 0.1245 0.3745 0.6245 0.8745 average: 124.5 374.5 624.5 874.5 source: index: 0.05 0.3 0.55 0.8 average: 5 30 55 80I am not really following you, I have to think about it. However, shouldn't the index be same for the ends?

Well, the *quantum effect* works different depending on sample size.

For example 0:9 will be split as `0 1 | 2 3 4 | 5 6 | 7 8 9`

, so the right hand end index will be (7+8+9)/3 / 10 = 0.8 whereas e.g. 0:3 would be split into `0 | 1 | 2 | 3`

and the right hand end index would be 3 / 4 = 0.75.

### comment:30 Changed 12 years ago by

### comment:31 Changed 12 years ago by

### comment:33 Changed 12 years ago by

(In [1778]) refs #478. Fixed so the tests for weighted qQN now are OK. To align weighted and unweighted, I needed to change the definition of index. The weighted index is more natural as the sum of weights with value less than x plus half of the weight for x. The half of its own weight part makes it quite symmetric. Anyway, for unity weights it becomes, e.g, for a vector of size 4: 0.5, 1.5, 2.5, 3.5. Obviously this doesnt change anything in the behavior since we are just adding a 0.5 to all index.

Also I decided to re-scale index inside the Partitioner, so the index are now within range (0, 1).

Weighted version of normalize had to be implemented in itself and could not use the unweighted because the weights come into play also in this step.

I wanna test with a weighted range also as target, and when that works I think we can close this ticket.

### comment:34 Changed 12 years ago by

### comment:35 Changed 12 years ago by

### comment:39 Changed 12 years ago by

Resolution: | → fixed |
---|---|

Status: | assigned → closed |

**Note:**See TracTickets for help on using tickets.

The unweighted version is described as:

I haven't thought this through very carefully, but I think it might be sufficient to generalize the qQuantileNormalizer::Partioner to also work on a weighted range. This implies that only step 2)- 4) need to be generalized. Step 5) and 6) can be performed as before.

A weighted Partioner should be straight forward to implement.