Opened 12 years ago

Closed 12 years ago

## #425 closed enhancement (fixed)

# Implement qQuantile normalizer

Reported by: | Jari Häkkinen | Owned by: | Jari Häkkinen |
---|---|---|---|

Priority: | major | Milestone: | yat 0.5 |

Component: | normalizer | Version: | trunk |

Keywords: | Cc: |

### Description (last modified by )

The idea is to normalize each column in a matrix to match a target columm (or maybe this should be a target vector).

Matching is done like this for each column:

- Partition into quantiles, q_i=(i-0,5)/N, i=1, 2, ..., N
- Calculate the arithmetic mean for each quantile
- Do the same for the target
- Calculate the difference between of target and column means. Now we have N differences d_i.
- Create a cubic spline fit to this difference vector d. The resulting curve is used to recalculate all column values.
- For values in range [q_i, q_N] we use the cubic spline fit.
- For values outside the range, <q_i of >q_N a linear extrapolation is used

Linear interpolation simply means a translation here (at least for now).

This method is inspired by the work of Workman *et al.* "A new non-linear normalization method for reducing variability in DNA microarray experiments", Genome Biol. 2002; 3(9):
PMID: 12225587 article at Genome Biology

### Change History (33)

### comment:1 Changed 12 years ago by

Status: | new → assigned |
---|

### comment:2 Changed 12 years ago by

### comment:3 Changed 12 years ago by

(In [1447]) Changed the interface of QuantileNormalizer?. The interface now has one matrix& and one const&. The former is object for normalization, and the latter is the base for calculating the distributions (Quantiles). The reason to change the interface is to allow normalization based on a nother matrix, e.g., normalizing a test data matrix based on the training data matrix. Also it is more inline with the Centralizer added earlier today. refs #425

### comment:4 Changed 12 years ago by

(In [1518]) changed implementation of a couple of normalizer so that operator() takes two matrices of which one is an input marix and one is a result matrix, i.e., the result does only depend on the input matrix and the result is only used as a receiver - I think this behaviour is more intuitive, although it is not as general as the previous one. We should probably document somewhere how we expect a 2Dnormalizer to behave (refs #425).

### comment:5 Changed 12 years ago by

Description: | modified (diff) |
---|---|

Summary: | Implement qsplines normalizer → Implement cubic spline normalizer |

### comment:6 Changed 12 years ago by

Description: | modified (diff) |
---|

### comment:7 follow-up: 8 Changed 12 years ago by

Regarding the Geometric Mean, I think we should have a functor in line with statistics::Percentiler and statistics::Average.

### comment:8 follow-up: 13 Changed 12 years ago by

Replying to peter:

Regarding the Geometric Mean, I think we should have a functor in line with statistics::Percentiler and statistics::Average.

I was thinking to log the values and use the usual average, any thoughts?

### comment:9 Changed 12 years ago by

Description: | modified (diff) |
---|

### comment:13 Changed 12 years ago by

Replying to jari:

Replying to peter:

Regarding the Geometric Mean, I think we should have a functor in line with statistics::Percentiler and statistics::Average.

I was thinking to log the values and use the usual average, any thoughts?

That's a matter of speed, I guess. The `log`

transform and then the transform back (`exp`

) takes some time. OTOH in that way we can avoid a lot of multiplications (additions instead), and more importantly we can avoid the terrible *Nth root*. So I agree, going via the arithmetic Average is the way to go. The Geometric Average is not that uncommon though, so it is probably a good idea to have that functionality. It's up to you if you want to add that class now, or if we do it later...

### comment:14 Changed 12 years ago by

### comment:15 Changed 12 years ago by

Summary: | Implement cubic spline normalizer → Implement qQuantile normalizer |
---|

The normalizer will be named qQuantileNormalizer. The cubic spline part is simply curve fitting.

### comment:18 Changed 12 years ago by

### comment:19 Changed 12 years ago by

### comment:20 follow-up: 24 Changed 12 years ago by

I have a question on how this class is supposed to be used. My concern is the *target* in constructor. I understand putting it there, makes the class general.

However, the typical use, how it is described in the paper, is that the target vector is calculated as the average column vector in the matrix that is to be normalized. Would it be possible to add that as a feature? In that case, if there is no target given it could be calculated from the matrix at operator() time. This also implies that the class supports the case when data is not available at construction time.

### comment:22 Changed 12 years ago by

Description: | modified (diff) |
---|

### comment:24 follow-up: 25 Changed 12 years ago by

Replying to peter:

I have a question on how this class is supposed to be used. My concern is the

targetin constructor. I understand putting it there, makes the class general.However, the typical use, how it is described in the paper, is that the target vector is calculated as the average column vector in the matrix that is to be normalized. Would it be possible to add that as a feature? In that case, if there is no target given it could be calculated from the matrix at operator() time. This also implies that the class supports the case when data is not available at construction time.

The reason for using *target* was to make it general. The workman paper uses the average whereas Illumina BeadStudio uses the first column in the matrix to be normalized.

So the problem is that you'd like to use a specific algorithm if no target was given whereas another user would like to see something else as default. Maybe we should scrap the operator() and instead have member functions that uses different targets

normalizeAverageColumn(matrix,matrix); normalizeTargetColumn((size_t,matrix,matrix); normalizeTarget(vector,matrix,matrix);

and change the constructor(target,u_int) to constructor(u_int #quantiles). You still need to know the number of quantiles at construction time.

### comment:25 Changed 12 years ago by

Replying to jari:

Replying to peter:

I have a question on how this class is supposed to be used. My concern is the

targetin constructor. I understand putting it there, makes the class general.However, the typical use, how it is described in the paper, is that the target vector is calculated as the average column vector in the matrix that is to be normalized. Would it be possible to add that as a feature? In that case, if there is no target given it could be calculated from the matrix at operator() time. This also implies that the class supports the case when data is not available at construction time.

The reason for using

targetwas to make it general.

I understand that reason.

The workman paper uses the average whereas Illumina BeadStudio uses the first column in the matrix to be normalized.

and your example makes even clearer although I don't understand why BeadStudio breaks the symmetry. But that discussion belongs elsewhere.

So the problem is that you'd like to use a specific algorithm if no target was given whereas another user would like to see something else as default. Maybe we should scrap the operator() and instead have member functions that uses different targets

normalizeAverageColumn(matrix,matrix); normalizeTargetColumn((size_t,matrix,matrix); normalizeTarget(vector,matrix,matrix);and change the constructor(target,u_int) to constructor(u_int #quantiles). You still need to know the number of quantiles at construction time.

I think I prefer keeping the interface uniform across different normalization classes. Instead of providing those functions you descibe, I would suggest that we could provide wrapper classes that essentially do the same thing. However, I don't think we need to add them now. We can wait and see what we think after using 0.5.

In conclusion, let it be as it is because it can easily be extended both inside or outside the library.

### comment:26 Changed 12 years ago by

### comment:27 Changed 12 years ago by

`make check`

fails and I suspect it is qQuantileNormalizer being the cause.

On Tiger, I have

$ ./test/normalization_test -v testing normalizations ... Testing Centralizer Testing ColumnNormalizer Testing m(0,0) Testing m(0,1) Testing m(1,0) Testing m(1,1) Testing qQuantileNormalizer testing number of parts (Q) boundary conditions test that result can be stored in the source matrix Testing that iterative normalization terminate called after throwing an instance of 'theplu::yat::utility::GSL_error' what(): GSL_error: GSL error code 1:evaluate Abort trap

### comment:28 Changed 12 years ago by

### comment:29 Changed 12 years ago by

### comment:30 Changed 12 years ago by

(In [1736]) Changing the interface to work on ranges rather than Matrix. This allows usage within ColumnNormalizer? and RowNormalizer?. refs #425.

### comment:32 Changed 12 years ago by

### comment:33 Changed 12 years ago by

Resolution: | → fixed |
---|---|

Status: | assigned → closed |

**Note:**See TracTickets for help on using tickets.

Is this normalizer working on a matrix or on a vector? For example QuantileNormaizer? is working on a matrix, whereas a median centralizer would be working on a vector because it centralizes each column independently. In practice that means that you typically end up looping over the columns.

In that latter case an idea would be to have a class that does the looping. So we could have a class

`SomeName<T>`

which does the looping over the columns and the class T defines how to do the actual normalization (on a vector... well a range).