Opened 12 years ago

Closed 12 years ago

#425 closed enhancement (fixed)

Implement qQuantile normalizer

Reported by: Jari Häkkinen Owned by: Jari Häkkinen
Priority: major Milestone: yat 0.5
Component: normalizer Version: trunk
Keywords: Cc:

Description (last modified by Peter)

The idea is to normalize each column in a matrix to match a target columm (or maybe this should be a target vector).

Matching is done like this for each column:

  1. Partition into quantiles, q_i=(i-0,5)/N, i=1, 2, ..., N
  2. Calculate the arithmetic mean for each quantile
  3. Do the same for the target
  4. Calculate the difference between of target and column means. Now we have N differences d_i.
  5. Create a cubic spline fit to this difference vector d. The resulting curve is used to recalculate all column values.
    1. For values in range [q_i, q_N] we use the cubic spline fit.
    2. For values outside the range, <q_i of >q_N a linear extrapolation is used

Linear interpolation simply means a translation here (at least for now).

This method is inspired by the work of Workman et al. "A new non-linear normalization method for reducing variability in DNA microarray experiments", Genome Biol. 2002; 3(9): PMID: 12225587 article at Genome Biology

Change History (33)

comment:1 Changed 12 years ago by Jari Häkkinen

Status: newassigned

comment:2 Changed 12 years ago by Peter

Is this normalizer working on a matrix or on a vector? For example QuantileNormaizer? is working on a matrix, whereas a median centralizer would be working on a vector because it centralizes each column independently. In practice that means that you typically end up looping over the columns.

In that latter case an idea would be to have a class that does the looping. So we could have a class SomeName<T> which does the looping over the columns and the class T defines how to do the actual normalization (on a vector... well a range).

comment:3 Changed 12 years ago by Peter

(In [1447]) Changed the interface of QuantileNormalizer?. The interface now has one matrix& and one const&. The former is object for normalization, and the latter is the base for calculating the distributions (Quantiles). The reason to change the interface is to allow normalization based on a nother matrix, e.g., normalizing a test data matrix based on the training data matrix. Also it is more inline with the Centralizer added earlier today. refs #425

comment:4 Changed 12 years ago by Peter

(In [1518]) changed implementation of a couple of normalizer so that operator() takes two matrices of which one is an input marix and one is a result matrix, i.e., the result does only depend on the input matrix and the result is only used as a receiver - I think this behaviour is more intuitive, although it is not as general as the previous one. We should probably document somewhere how we expect a 2Dnormalizer to behave (refs #425).

comment:5 Changed 12 years ago by Jari Häkkinen

Description: modified (diff)
Summary: Implement qsplines normalizerImplement cubic spline normalizer

comment:6 Changed 12 years ago by Jari Häkkinen

Description: modified (diff)

comment:7 Changed 12 years ago by Peter

Regarding the Geometric Mean, I think we should have a functor in line with statistics::Percentiler and statistics::Average.

comment:8 in reply to:  7 ; Changed 12 years ago by Jari Häkkinen

Replying to peter:

Regarding the Geometric Mean, I think we should have a functor in line with statistics::Percentiler and statistics::Average.

I was thinking to log the values and use the usual average, any thoughts?

comment:9 Changed 12 years ago by Jari Häkkinen

Description: modified (diff)

comment:10 Changed 12 years ago by Jari Häkkinen

(In [1568]) Addresses #425.

comment:11 Changed 12 years ago by Jari Häkkinen

(In [1571]) Addresses #425.

comment:12 Changed 12 years ago by Jari Häkkinen

(In [1572]) Addresses #425.

comment:13 in reply to:  8 Changed 12 years ago by Peter

Replying to jari:

Replying to peter:

Regarding the Geometric Mean, I think we should have a functor in line with statistics::Percentiler and statistics::Average.

I was thinking to log the values and use the usual average, any thoughts?

That's a matter of speed, I guess. The log transform and then the transform back (exp) takes some time. OTOH in that way we can avoid a lot of multiplications (additions instead), and more importantly we can avoid the terrible Nth root. So I agree, going via the arithmetic Average is the way to go. The Geometric Average is not that uncommon though, so it is probably a good idea to have that functionality. It's up to you if you want to add that class now, or if we do it later...

comment:14 Changed 12 years ago by Jari Häkkinen

(In [1643]) Addresses #466 and #425. Added cubic spline interpolation (not complete interface to GSL interpolation stuff).

comment:15 Changed 12 years ago by Jari Häkkinen

Summary: Implement cubic spline normalizerImplement qQuantile normalizer

The normalizer will be named qQuantileNormalizer. The cubic spline part is simply curve fitting.

comment:16 Changed 12 years ago by Jari Häkkinen

(In [1708]) Addresses #425

comment:17 Changed 12 years ago by Jari Häkkinen

(In [1709]) Addresses #425. Changed partition function.

comment:18 Changed 12 years ago by Jari Häkkinen

(In [1712]) Addresses #425. Added linear extrapolation at ends. Added documentation. Cleaneup of code.

comment:19 Changed 12 years ago by Jari Häkkinen

(In [1713]) Addresses #425. Added static_cast to get rid of compiler warnings. Fixed index problems in Partitioner.

comment:20 Changed 12 years ago by Peter

I have a question on how this class is supposed to be used. My concern is the target in constructor. I understand putting it there, makes the class general.

However, the typical use, how it is described in the paper, is that the target vector is calculated as the average column vector in the matrix that is to be normalized. Would it be possible to add that as a feature? In that case, if there is no target given it could be calculated from the matrix at operator() time. This also implies that the class supports the case when data is not available at construction time.

comment:21 Changed 12 years ago by Jari Häkkinen

(In [1716]) Addresses #425. Made class Partitioner private.

comment:22 Changed 12 years ago by Jari Häkkinen

Description: modified (diff)

comment:23 Changed 12 years ago by Peter

Description: modified (diff)

changing geometric to arithmetic

comment:24 in reply to:  20 ; Changed 12 years ago by Jari Häkkinen

Replying to peter:

I have a question on how this class is supposed to be used. My concern is the target in constructor. I understand putting it there, makes the class general.

However, the typical use, how it is described in the paper, is that the target vector is calculated as the average column vector in the matrix that is to be normalized. Would it be possible to add that as a feature? In that case, if there is no target given it could be calculated from the matrix at operator() time. This also implies that the class supports the case when data is not available at construction time.

The reason for using target was to make it general. The workman paper uses the average whereas Illumina BeadStudio uses the first column in the matrix to be normalized.

So the problem is that you'd like to use a specific algorithm if no target was given whereas another user would like to see something else as default. Maybe we should scrap the operator() and instead have member functions that uses different targets

normalizeAverageColumn(matrix,matrix);
normalizeTargetColumn((size_t,matrix,matrix);
normalizeTarget(vector,matrix,matrix);

and change the constructor(target,u_int) to constructor(u_int #quantiles). You still need to know the number of quantiles at construction time.

comment:25 in reply to:  24 Changed 12 years ago by Peter

Replying to jari:

Replying to peter:

I have a question on how this class is supposed to be used. My concern is the target in constructor. I understand putting it there, makes the class general.

However, the typical use, how it is described in the paper, is that the target vector is calculated as the average column vector in the matrix that is to be normalized. Would it be possible to add that as a feature? In that case, if there is no target given it could be calculated from the matrix at operator() time. This also implies that the class supports the case when data is not available at construction time.

The reason for using target was to make it general.

I understand that reason.

The workman paper uses the average whereas Illumina BeadStudio uses the first column in the matrix to be normalized.

and your example makes even clearer although I don't understand why BeadStudio breaks the symmetry. But that discussion belongs elsewhere.

So the problem is that you'd like to use a specific algorithm if no target was given whereas another user would like to see something else as default. Maybe we should scrap the operator() and instead have member functions that uses different targets

normalizeAverageColumn(matrix,matrix);
normalizeTargetColumn((size_t,matrix,matrix);
normalizeTarget(vector,matrix,matrix);

and change the constructor(target,u_int) to constructor(u_int #quantiles). You still need to know the number of quantiles at construction time.

I think I prefer keeping the interface uniform across different normalization classes. Instead of providing those functions you descibe, I would suggest that we could provide wrapper classes that essentially do the same thing. However, I don't think we need to add them now. We can wait and see what we think after using 0.5.

In conclusion, let it be as it is because it can easily be extended both inside or outside the library.

comment:26 Changed 12 years ago by Jari Häkkinen

(In [1718]) Addresses #425. Improved docs. Fixed a bug in the normalization.

comment:27 Changed 12 years ago by Peter

make check fails and I suspect it is qQuantileNormalizer being the cause.

On Tiger, I have

$ ./test/normalization_test -v
testing normalizations ... 
Testing Centralizer
Testing ColumnNormalizer
Testing m(0,0)
Testing m(0,1)
Testing m(1,0)
Testing m(1,1)
Testing qQuantileNormalizer
testing number of parts (Q) boundary conditions
test that result can be stored in the source matrix
Testing that iterative normalization
terminate called after throwing an instance of 'theplu::yat::utility::GSL_error'
  what():  GSL_error:  GSL error code 1:evaluate 
Abort trap

comment:28 Changed 12 years ago by Jari Häkkinen

(In [1731]) Addresses #425. 'make check' passes now, but there is still a problem with the partitioner.

comment:29 Changed 12 years ago by Jari Häkkinen

(In [1735]) Addresses #425. Taking care of some special cases. No iterative test is need.

comment:30 Changed 12 years ago by Peter

(In [1736]) Changing the interface to work on ranges rather than Matrix. This allows usage within ColumnNormalizer? and RowNormalizer?. refs #425.

comment:31 Changed 12 years ago by Peter

(In [1755]) simplify code by using std::ceil - refs #425

comment:32 Changed 12 years ago by Peter

(In [1801]) changed template arguments to better describe requirements on passed iterators - refs #478 and #425

comment:33 Changed 12 years ago by Peter

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.