Opened 12 years ago

Closed 12 years ago

#478 closed discussion (fixed)

Generalize qQuantileNormalizer to work also on weighted data

Reported by: Peter Owned by: Peter
Priority: minor Milestone: yat 0.5
Component: normalizer Version: trunk
Keywords: Cc:

Description

related to ticket:401 (QuantileNormalizer)

How to define the algorithm when having weights around?

Change History (39)

comment:1 Changed 12 years ago by Peter

The unweighted version is described as:

  1. Data is not assumed to be sorted.
  2. Partition the target data in N parts. N must be 3 larger because of requirements from the underlying cspline fit
  3. Calculate the arithmetic mean for each part, the mean is assigned to the mid point of each part.
  4. Do the same for the data to be tranformed (called source here).
  5. For each part, calculate the difference between the target and the source. Now we have N differences d_i with associated rank (midpoint of each part).
  6. Create a cubic spline fit to this difference vector d. The resulting curve is used to recalculate all column values.
    1. Use the cubic spline fit for values within the cubic spline fit range [midpoint 1st part, midpoint last part].
    2. For data outside the cubic spline fit use linear extrapolation, i.e., a constant shift. d_first for points below fit range, and d_last for points above fit range.

I haven't thought this through very carefully, but I think it might be sufficient to generalize the qQuantileNormalizer::Partioner to also work on a weighted range. This implies that only step 2)- 4) need to be generalized. Step 5) and 6) can be performed as before.

A weighted Partioner should be straight forward to implement.

comment:2 Changed 12 years ago by Jari Häkkinen

Step 2 should say partition sorted data! Obviously data is sorted before partitioning.

comment:3 in reply to:  1 ; Changed 12 years ago by Jari Häkkinen

Replying to peter:

I haven't thought this through very carefully, but I think it might be sufficient to generalize the qQuantileNormalizer::Partioner to also work on a weighted range. This implies that only step 2)- 4) need to be generalized. Step 5) and 6) can be performed as before.

A weighted Partioner should be straight forward to implement.

I agree, but how should the sorting be done? What is the rank of a poor high intensity value compared with a good high intensity value? Maybe you suggest to not use weights in sorting?

comment:4 in reply to:  3 ; Changed 12 years ago by Peter

Replying to jari:

Replying to peter:

I haven't thought this through very carefully, but I think it might be sufficient to generalize the qQuantileNormalizer::Partioner to also work on a weighted range. This implies that only step 2)- 4) need to be generalized. Step 5) and 6) can be performed as before.

A weighted Partioner should be straight forward to implement.

I agree, but how should the sorting be done? What is the rank of a poor high intensity value compared with a good high intensity value? Maybe you suggest to not use weights in sorting?

Yes, I thought the sort should ignore the weight. Noticed that there is no operator< in DataWeight. Perhaps one need to take special care of the w=0 data points, but I don't think so.

comment:5 in reply to:  4 Changed 12 years ago by Peter

Replying to peter:

Noticed that there is no operator< in DataWeight.

There is actually an operator< but it is declared outside the class, hm.

comment:6 Changed 12 years ago by Peter

I this something we want in 0.5, i.e., do you (jari) need it for the baseplugin or wont you support missing values?

comment:7 in reply to:  6 Changed 12 years ago by Jari Häkkinen

Replying to peter:

I this something we want in 0.5, i.e., do you (jari) need it for the baseplugin or wont you support missing values?

For the first release of the baseplugin there will be support for missing values but only with binary weights. To accomplish this I only need qQuantiles with proper q-number, and I'll have different amount of data values in the parts depending on the number of missing values in each assay.

But I agree, it is better if we get the weighted feature into 0.5.

comment:8 Changed 12 years ago by Peter

Milestone: yat 0.x+yat 0.5
Status: newassigned

comment:9 Changed 12 years ago by Peter

(In [1737]) addresses #478 - preparing for weighted version

comment:10 Changed 12 years ago by Peter

(In [1738]) added support for creation of an qQuantileNormalizer from a weighted range. Addresses #478

comment:11 Changed 12 years ago by Peter

Some problem! The unweighted version assumes that index of target and index of source are the same. This is obviously not true in the weighted case, so the algorithm needs to be revised.

Is this a problem also in the unweighted case when sizes of target and source are not equal?

comment:12 Changed 12 years ago by Peter

(In [1740]) addresses #478 - implemented weighted normalization. Need to add test especially to test that the templates compile.

comment:13 Changed 12 years ago by Peter

Summary: Generalize qQuantileNormalizer to work also on MatrixWeightedGeneralize qQuantileNormalizer to work also on weighted data

comment:14 Changed 12 years ago by Peter

(In [1750]) added a compilation test for qQN::operator() with weighted iterators. refs #478

comment:15 Changed 12 years ago by Peter

(In [1751]) refs #478 - updated docs and simplified call to data_iterator and weight_iterator.

comment:16 in reply to:  11 ; Changed 12 years ago by Peter

Replying to peter:

Some problem! The unweighted version assumes that index of target and index of source are the same. This is obviously not true in the weighted case, so the algorithm needs to be revised.

Is this a problem also in the unweighted case when sizes of target and source are not equal?

I understand the implementation as follows: the partitioner sorts the data and divides the N data points into Q buckets such that each bucket approximately contains N/Q data points. For each bucket the average data value and the average index is calculated - called averages and index, respectively.

We create one partitioner of the target and one for the source. As both are of size Q we can take the difference between averages: diff_averages = source_averages - target_averages This vector is then the input for the cSpline... together with target_index.

My concern is that if N_target != N_sources, the vectors target_index and source_index are typically not equal. Therefore, it is not obvious that it makes sense to interpolate {target_index; diff_averages} because it is not obvious that target_index is the most relevant choice.

Some solutions I can think of:

  1. Let it be as it is. (always an option)
  1. Require that N_target equals N_source, which will invalidate this discussion.
  1. Skip the calculation of index and just use {0, 1, 2, ...} instead. If N is a multiple of Q this will give identical result.
  1. Rather than taking the difference: target - source and then input the diff to cSpline, calculate cSpline of target and of source and calculate difference: cSpline(target) - cSpline(source). Then the two cSpline would be based on different index. Would this change the behavior when N_source = N_target? I'm not very familiar with cSpline and therefore cannot answer that question, but clearly for linear interpolation it would not change the result. Anyhow, the change would likely not be large, but the problem is that we diverge from the original algorithm described in the literature.

Thoughts?

comment:17 in reply to:  16 Changed 12 years ago by Peter

Replying to peter:

Thoughts?

I forgot the weighted context. In general the index can vary more when having arbitrary weights. 2) does not solve the weighted case because equally many data points does not guarantee same weight distribution.

comment:18 in reply to:  11 Changed 12 years ago by Jari Häkkinen

Replying to peter:

Some problem! The unweighted version assumes that index of target and index of source are the same. This is obviously not true in the weighted case, so the algorithm needs to be revised.

Is this a problem also in the unweighted case when sizes of target and source are not equal?

Yes. For my data it has not been a serious problem since the sizes are large and the difference in size is relatively small. There is a problem though, and we must address it.

comment:19 in reply to:  16 Changed 12 years ago by Jari Häkkinen

Replying to peter:

Replying to peter:

Some problem! The unweighted version assumes that index of target and index of source are the same. This is obviously not true in the weighted case, so the algorithm needs to be revised.

Is this a problem also in the unweighted case when sizes of target and source are not equal?

I understand the implementation as follows: the partitioner sorts the data and divides the N data points into Q buckets such that each bucket approximately contains N/Q data points. For each bucket the average data value and the average index is calculated - called averages and index, respectively.

We create one partitioner of the target and one for the source. As both are of size Q we can take the difference between averages: diff_averages = source_averages - target_averages This vector is then the input for the cSpline... together with target_index.

My concern is that if N_target != N_sources, the vectors target_index and source_index are typically not equal. Therefore, it is not obvious that it makes sense to interpolate {target_index; diff_averages} because it is not obvious that target_index is the most relevant choice.

I agree. If N_target > N_source there will be a discontinuity in the higher indexes, in the border between cspline interpolation and the linear extrapolation. Was that understandable? The effect is that the cspline interpolated result vector at last cspline point will not be equal to the first linear exptrapolation point (at the high index end).

The other case N_target < N_source is also problematic and will probably throw an exception if the size difference is too large.

Some solutions I can think of:

  1. Let it be as it is. (always an option)

No.

  1. Require that N_target equals N_source, which will invalidate this discussion.

No.

  1. Skip the calculation of index and just use {0, 1, 2, ...} instead. If N is a multiple of Q this will give identical result.

How will this help? I assume that the {0, 1, 2, ...} series is for the partitioner index? But I think this can be extend to a solution ... more in the next post (that will not be submitted immediately, sorry).

  1. Rather than taking the difference: target - source and then input the diff to cSpline, calculate cSpline of target and of source and calculate difference: cSpline(target) - cSpline(source). Then the two cSpline would be based on different index. Would this change the behavior when N_source = N_target? I'm not very familiar with cSpline and therefore cannot answer that question, but clearly for linear interpolation it would not change the result. Anyhow, the change would likely not be large, but the problem is that we diverge from the original algorithm described in the literature.

Well, the algorithm is not too well defined in the literature. The above maybe a possible solution but I have another coming up in a later post.

cspline is a 3rd degree polynomial fit to a series of data where the resulting interpolation function must yield exactly the value that was used as input. I.e., if one data pair input was f(10)=9, then the interpolating function must return 9 for a 10 argument. Between the defining points the fit is used. Based on this explanation of cspline, I do not think the current qQN implementation and your suggestion above will generally be the same for N_source=N_target.

comment:20 Changed 12 years ago by Jari Häkkinen

I think we can resolve the issue of different sized target and source with rescaling the range of the source to the target. This should also work for the weighted case.

The loop in the cspline interpolation should be changed to

double step=N_target/N_source; // careful ... integer division 
double this_cspline_argument=start;
for (size_t row=start; row<=end; ++row) {
  size_t srow=sorted_index[row];
  result[srow] = first[srow] - cspline.evaluate(this_cspline_argument);
  this_cspline_argument+=step;
}

In this way a smaller N_source will be stretched to fit a larger range, and a larger N_source will be shrinked to a smaller range. Of course, N_source and N_target needs to be adjusted to actually be the size of the interpolation range sizes.

comment:21 Changed 12 years ago by Peter

I'm not really following here.

But I just realized that the vector index has very different role from how I pictured the algorithm. I need to re-think here because index scales with N in a way I did not anticipate.

comment:22 in reply to:  20 ; Changed 12 years ago by Peter

Replying to jari:

In this way a smaller N_source will be stretched to fit a larger range, and a larger N_source will be shrinked to a smaller range. Of course, N_source and N_target needs to be adjusted to actually be the size of the interpolation range sizes.

OK, the problem you fix here is different from the one I pointed out. The problem you fix is much more severe. Essentially what you are doing is to rescale the index in Partitioner to range between say 0 and 1. I would prefer that Partitioner took care of that scaling, but it doesn't matter.

Note, that my suggestions above was not intended to solve the problem you are talking about, but a different problem/question mark (I'll come back to that in a bit).

comment:23 Changed 12 years ago by Peter

It's difficult to explain the problem here, so let's look at this patch first.

Index: test/normalization_test.cc
===================================================================
--- test/normalization_test.cc	(revision 1769)
+++ test/normalization_test.cc	(working copy)
@@ -126,6 +126,28 @@
 	using namespace normalizer;
 
 	suite.err() << "Testing qQuantileNormalizer\n";
+
+	std::vector<double> t;
+	while (t.size()<1000)
+		t.push_back(t.size());
+	std::cout << "target:" << std::endl;
+	qQuantileNormalizer pj_qqn(t.begin(), t.end(), 4);  
+
+	std::vector<double> s;
+	while (s.size()<10)
+		s.push_back(s.size()*10.0);
+
+	std::cout << "\nsource:" << std::endl;
+	pj_qqn(s.begin(), s.end(), s.begin());
+
+	for (size_t i=0; i<s.size(); ++i)
+		std::cout << s[i] << std::endl;
+
+	exit(0);
+
+	std::vector<double> source;
+
+
 	std::string data(test::filename("data/normalization_test.data"));
 	if (utility::FileUtil(data.c_str()).permissions("r")) {
 		suite.add(false);
Index: yat/normalizer/qQuantileNormalizer.cc
===================================================================
--- yat/normalizer/qQuantileNormalizer.cc	(revision 1769)
+++ yat/normalizer/qQuantileNormalizer.cc	(working copy)
@@ -58,9 +58,11 @@
 			for (unsigned int r=start; r<end; ++r)
 				av.add(sortedvec(r));
 			average_(i) = av.mean();
-			index_(i)   = 0.5*(end+start-1);
+			index_(i)   = 0.5*(end+start-1.0)/sortedvec.size();
 			start=end;
 		}
+		std::cout << "index: " << index_ << std::endl;
+		std::cout << "average: " << average_ << std::endl;
 	}
 
 
@@ -97,7 +99,7 @@
 				throw std::runtime_error(ss.str());
 			}
 			average_(i) = av.mean();
-			index_(i)   = sum_w + 0.5*av.sum_w();
+			index_(i)   = (sum_w + 0.5*av.sum_w())/total_w;
 			sum_w += av.sum_w();
 		}
 	}

The changes in yat/normalizer/qQuantileNormalizer.cc is just intended to fix the rescaling problem and also I added some debug output.

In the test we create two linear datasets that is the value depend on the index in a linear fashion (a*x+b). However, the constant a is different for source and target and I'd expect the normalization to find that difference and normalize it.

That is not the case, but instead I get this output:

target:
index: 0.1245 0.3745 0.6245 0.8745
average: 124.5 374.5 624.5 874.5

source:
index: 0.05 0.3 0.55 0.8
average: 5 30 55 80
119.5
804.5
814.5
824.5
834.5
844.5
854.5
864.5
874.5
884.5

The problem is how we are treating the index. Even after my rescaling, the index vectors are not the same, which implies that taking the difference between the average vectors doesn't really make sense.

It's like if we compare the weather in Kulladal and Rockville, and we take the temperature in Rockville Feb 4 and temperature in Kulladal Apr 1. The difference in temperature has no meaning (at least not to me). See what I'm saying?

comment:24 in reply to:  22 ; Changed 12 years ago by Jari Häkkinen

Replying to peter:

OK, the problem you fix here is different from the one I pointed out. The problem you fix is much more severe. Essentially what you are doing is to rescale the index in Partitioner to range between say 0 and 1. I would prefer that Partitioner took care of that scaling, but it doesn't matter.

The partitioner can take care of the scaling. If you know how you'd like to do it, please go ahead and fix it.

comment:25 in reply to:  23 ; Changed 12 years ago by Jari Häkkinen

Replying to peter:

target:
index: 0.1245 0.3745 0.6245 0.8745
average: 124.5 374.5 624.5 874.5

source:
index: 0.05 0.3 0.55 0.8
average: 5 30 55 80

I am not really following you, I have to think about it. However, shouldn't the index be same for the ends?

comment:26 in reply to:  24 Changed 12 years ago by Peter

Replying to jari:

Replying to peter:

OK, the problem you fix here is different from the one I pointed out. The problem you fix is much more severe. Essentially what you are doing is to rescale the index in Partitioner to range between say 0 and 1. I would prefer that Partitioner took care of that scaling, but it doesn't matter.

The partitioner can take care of the scaling. If you know how you'd like to do it, please go ahead and fix it.

Hm, changing in the Partitioner was not sufficient, you need adjustment in the normalization too, so your suggestion is probably preferable. Because of this the normalization output above is bogus.

comment:27 Changed 12 years ago by Peter

(In [1771]) rescaling index - refs #478

comment:28 Changed 12 years ago by Peter

I re-run the test again:

target= {0, 1, ..., 999}
source = {0, 10, ..., 90}

giving the following result:

target averages: 124.5 374.5 624.5 874.5
source averages: 5 30 55 80
diff averages: -119.5 -344.5 -569.5 -794.5
target index: 124.5 374.5 624.5 874.5
target index normalized: 1.245 3.745 6.245 8.745
source index: 0.5 3 5.5 8

source_org source_normalized
0 119.5
10 129.5
20 207.45
30 307.45
40 407.45
50 507.45
60 607.45
70 707.45
80 807.45
90 884.5

Looks good except the end points.

comment:29 in reply to:  25 Changed 12 years ago by Peter

Replying to jari:

Replying to peter:

target:
index: 0.1245 0.3745 0.6245 0.8745
average: 124.5 374.5 624.5 874.5

source:
index: 0.05 0.3 0.55 0.8
average: 5 30 55 80

I am not really following you, I have to think about it. However, shouldn't the index be same for the ends?

Well, the quantum effect works different depending on sample size.

For example 0:9 will be split as 0 1 | 2 3 4 | 5 6 | 7 8 9, so the right hand end index will be (7+8+9)/3 / 10 = 0.8 whereas e.g. 0:3 would be split into 0 | 1 | 2 | 3 and the right hand end index would be 3 / 4 = 0.75.

comment:30 Changed 12 years ago by Peter

(In [1773]) adding a test to check that qQN give same result with weighted source as with unweighted source. The test failed, which implies weighted and unweighted method are not in synch. refs #478

comment:31 Changed 12 years ago by Peter

(In [1774]) refs #478 - fixed bugs in test code... the test still fails though.

comment:32 Changed 12 years ago by Peter

(In [1775]) refs #478 - added another test

comment:33 Changed 12 years ago by Peter

(In [1778]) refs #478. Fixed so the tests for weighted qQN now are OK. To align weighted and unweighted, I needed to change the definition of index. The weighted index is more natural as the sum of weights with value less than x plus half of the weight for x. The half of its own weight part makes it quite symmetric. Anyway, for unity weights it becomes, e.g, for a vector of size 4: 0.5, 1.5, 2.5, 3.5. Obviously this doesnt change anything in the behavior since we are just adding a 0.5 to all index.

Also I decided to re-scale index inside the Partitioner, so the index are now within range (0, 1).

Weighted version of normalize had to be implemented in itself and could not use the unweighted because the weights come into play also in this step.

I wanna test with a weighted range also as target, and when that works I think we can close this ticket.

comment:34 Changed 12 years ago by Peter

(In [1784]) refs #478. Added test with weighted target, and made 'missing values' to be nan in tests.

comment:35 Changed 12 years ago by Peter

(In [1801]) changed template arguments to better describe requirements on passed iterators - refs #478 and #425

comment:36 Changed 12 years ago by Peter

(In [1803]) fixed typo - refs #478

comment:37 Changed 12 years ago by Peter

(In [1810]) working on documentation. refs #478

comment:38 Changed 12 years ago by Peter

(In [1812]) working on documentation. refs #478

comment:39 Changed 12 years ago by Peter

Resolution: fixed
Status: assignedclosed

(In [1814]) Finalized dox for qQuantileNormalizer. Closes #478

Note: See TracTickets for help on using tickets.