Opened 17 years ago
Last modified 7 days ago
#216 accepted enhancement
PCA is lacking functionality
Reported by: | Jari Häkkinen | Owned by: | Peter |
---|---|---|---|
Priority: | major | Milestone: | yat 0.22 |
Component: | utility | Version: | trunk |
Keywords: | Cc: |
Description (last modified by )
Currently, PCA only performs a decomposition if the number of rows is not smaller than the number of columns.
PCA should work also for this case choose a functional algorithm automatically.
In addition, there is a lot of functionality missing that is useful.
Change History (23)
comment:1 Changed 16 years ago by
comment:2 Changed 16 years ago by
I am not sure I perfectly understand what you mean, but I think I disagree.
I think the PCA should only work on X'X or XX'. Mixing these to cases will make the interface confusing, and I think it is easier to let the user transponate the matrix if he wants the transponate problem.
comment:3 Changed 16 years ago by
I think that since all the information regarding X'X is contained in the decomposition of XX' and its projections the interface should reflect this.
If a user does PCA of XX' he should get access to both U and V (through the components and the normalized projections). My concern was that now to get V for the original problem one either has to use project on the original data matrix and then normalize the projections or transpose the problem and call PCA again (if both N>M and M>N would have been supported as suggested by this ticket originally). This way of getting V seems non-optimal since (i) computationally it is calculated in the original decomposition and (ii) pedagogically the interface would benefit from reflecting that a component in one direction of the problem corresponds to a projection in the other direction of the problem.
So what is missing is access to normalized projections (V). I think adding this function would rather than make the interface confusing make it more appealing and useful. I also think we could avoid confusion for users if we could document such functions so they reflect both the PCA and the SVD terminology.
Of course a user should always be able to decide whether the problem is X'X or XX' and nothing prevents him from doing the transpose first (if the PCA code, as I think it should, supports both N>M and M>N).
comment:4 follow-up: 5 Changed 16 years ago by
I would agree if the PCA diagonalized XX'.
However, it does not. In current design, there is a built-in centralization meaning we are diagonalizing the centralized data (X-M)(X-M)', where M are the centroids. And consequently the other problem is (X-Q)'(x-Q) where Q is the centroids in the other direction. In general there is no connection between centroid M and Q, and therefore I think your Markus's arguments above are not valid. If you wanna run a PCA both in sample space as well as in feature space, you have to run the PCA twice.
I agree that the current interface is to small. I think Jari made it small on purpose because there were flaws and he didn't know what should be included.
From a user perspective (although I'm not really a user here) I think an interface should include:
- the principal components
- the corresponding eigenvalues (perhaps normalized)
- transformation from original coordinates to principal component coordinates
- The inverse transformation of above
Implementation wise, I am not sure what this talk about N>M is about. But if SVD(X') is more efficient than SVD(X) then of course PCA should choose the appropriate choice.
comment:5 follow-up: 6 Changed 16 years ago by
Replying to peter:
I would agree if the PCA diagonalized XX'.
However, it does not. In current design, there is a built-in centralization meaning we are diagonalizing the centralized data (X-M)(X-M)', where M are the centroids. And consequently the other problem is (X-Q)'(x-Q) where Q is the centroids in the other direction. In general there is no connection between centroid M and Q, and therefore I think your Markus's arguments above are not valid. If you wanna run a PCA both in sample space as well as in feature space, you have to run the PCA twice.
I agree and I was apparently not very clear. My idea was that most often a user wants the eigenvectors to both (X-M)(X-M)' and (X-M)'(X-M) [eigengenes and eigenarrays in the microarray field, but a lot of other names in the PCA literature] and therefore I think the interface should support access to both. Users interested in (X-Q)'(X-Q) should not be supported through the interface but have to transpose the their matrix first as Peter is saying.
The normalization could be addressed in many ways. Looking at PCA in statistical packages such as R it is not necessarily the covariance matrix that is diagonalized, instead the user is left with a variety of choices for the normalization. In some microarray papers where one wants to look at genes and samples simultaneously the data is iteratively normalized aiming at having both rows and columns centralized such that the differences between M and Q discussed by Peter above are gone.
Others leave the normalization outside of PCA but that is not good here, since we expect to be able to use our PCA on independent test samples I think the normalization would have to be stored within the PCA class for functionality. So should we support different normalizations?
I agree that the current interface is to small. I think Jari made it small on purpose because there were flaws and he didn't know what should be included.
From a user perspective (although I'm not really a user here) I think an interface should include:
- the principal components
- the corresponding eigenvalues (perhaps normalized)
- transformation from original coordinates to principal component coordinates
- The inverse transformation of above
So in addition, I would still like to have access to the eigenvectors in the other direction (U and V).
Also often users will not only want projections onto but also correlations with the components. One could imagine other metrics, but these two at least have nice interpretations. Should metric choices be supported?
Implementation wise, I am not sure what this talk about N>M is about. But if SVD(X') is more efficient than SVD(X) then of course PCA should choose the appropriate choice.
SVD is only defined if the number of rows is larger than the number of columns. So I think if the opposite case is sent in then SVD should be applied to X' internally and not X and then the internal structure should keep track of what is what such that a user never sees this.
comment:6 follow-up: 7 Changed 16 years ago by
Replying to markus:
Replying to peter:
I would agree if the PCA diagonalized XX'.
However, it does not. In current design, there is a built-in centralization meaning we are diagonalizing the centralized data (X-M)(X-M)', where M are the centroids. And consequently the other problem is (X-Q)'(x-Q) where Q is the centroids in the other direction. In general there is no connection between centroid M and Q, and therefore I think your Markus's arguments above are not valid. If you wanna run a PCA both in sample space as well as in feature space, you have to run the PCA twice.
I agree and I was apparently not very clear. My idea was that most often a user wants the eigenvectors to both (X-M)(X-M)' and (X-M)'(X-M) [eigengenes and eigenarrays in the microarray field, but a lot of other names in the PCA literature] and therefore I think the interface should support access to both. Users interested in (X-Q)'(X-Q) should not be supported through the interface but have to transpose the their matrix first as Peter is saying.
A question: Say that (X-M)(X-M)' is covariance matrix of genes, what is then (X-M)'(X-M)?
An eigenvector of (X-M)(X-M)' has a clear meaning, but I don't understand what an eigenvector of (X-M)'(X-M) means.
Also I think it is confusing if PCA(X).eigengene(0) is not the same as PCA(X').eigensample(0) (if you excuse my notation).
The normalization could be addressed in many ways. Looking at PCA in statistical packages such as R it is not necessarily the covariance matrix that is diagonalized, instead the user is left with a variety of choices for the normalization. In some microarray papers where one wants to look at genes and samples simultaneously the data is iteratively normalized aiming at having both rows and columns centralized such that the differences between M and Q discussed by Peter above are gone.
Here at POB people talk about Euclidean PCA and Pearson PCA. In the latter correlation matrix is diagonalized rather than covariance matrix, or if you prefer normalization language data are z-normalized.
Others leave the normalization outside of PCA but that is not good here, since we expect to be able to use our PCA on independent test samples I think the normalization would have to be stored within the PCA class for functionality. So should we support different normalizations?
I don't think we should support different normalizations (at least not for milestone:0.4).
I agree that the current interface is to small. I think Jari made it small on purpose because there were flaws and he didn't know what should be included.
From a user perspective (although I'm not really a user here) I think an interface should include:
- the principal components
- the corresponding eigenvalues (perhaps normalized)
- transformation from original coordinates to principal component coordinates
- The inverse transformation of above
So in addition, I would still like to have access to the eigenvectors in the other direction (U and V).
Also often users will not only want projections onto but also correlations with the components. One could imagine other metrics, but these two at least have nice interpretations. Should metric choices be supported?
Well, I think calculating correlations are supported elsewhere, i.e., I don't want it here.
Implementation wise, I am not sure what this talk about N>M is about. But if SVD(X') is more efficient than SVD(X) then of course PCA should choose the appropriate choice.
SVD is only defined if the number of rows is larger than the number of columns. So I think if the opposite case is sent in then SVD should be applied to X' internally and not X and then the internal structure should keep track of what is what such that a user never sees this.
comment:7 follow-up: 8 Changed 16 years ago by
Replying to peter:
An eigenvector of (X-M)(X-M)' has a clear meaning, but I don't understand what an eigenvector of (X-M)'(X-M) means.
Eigenvectors of (X-M)'(X-M) are projections of (X-M) onto eigenvectors of (X-M)(X-M)', and they have a clear meaning if you want to decompose a matrix into dominant variable- and sample-like patterns simultaneously.
Suppose our matrix is row-centered: Y=X-M. For SVD of Y we get U and V that are eigenvectors to YY' and Y'Y.' The terminlogy eigengene/variable and eigenarray/sample is used for this case where the same matrix is decomposed and not for PCA(X) and PCA(X').
Also I think it is confusing if PCA(X).eigengene(0) is not the same as PCA(X').eigensample(0) (if you excuse my notation).
Perhaps, but that is how people have defined it since you want eigenarrays and eigengenes to reflect the same matrix and not the co-variance matrix of the genes and the samples, respectively. Typically people row-center their matrix in which case the eigenarrays are eigenvectors to the covariance matrix of the genes and the eigengenes (=normalized PCA projections) are eigenvectors but not to the covariance matrix of the samples.
I think of it more as if PCA(X) gives both eigengenes and eigenarrays and so do PCA(X'). Different normalizations give different eigengenes and eigenarrays. PCA(X) and PCA(X') normalize differently so they give different eigengenes and eigensamples.
Either way, I think most of this is semantics that could/should be avoided in the PCA interface. What I do want is access to V (=normalized projections), because it would be slow to first project data onto U (accessible through the interface) and then normalize the projections when this useful information is already calculated internally.
I don't think we should support different normalizations (at least not for milestone:0.4).
OK, but we should keep it in mind, see more below.
Well, I think calculating correlations are supported elsewhere, i.e., I don't want it here.
I do not think this is independent from the normalization issue. Often one wants to calculate correlations to components. But in the present structure this is not feasible since if a user comes along with a test sample it has to be centralized using the centroid from the training data, which is internal to the PCA class.
So the current interface hinders a lot of common PCA applications.
comment:8 follow-up: 9 Changed 16 years ago by
Replying to markus:
Replying to peter:
Also I think it is confusing if PCA(X).eigengene(0) is not the same as PCA(X').eigensample(0) (if you excuse my notation).
Perhaps, but that is how people have defined it since you want eigenarrays and eigengenes to reflect the same matrix and not the co-variance matrix of the genes and the samples, respectively. Typically people row-center their matrix in which case the eigenarrays are eigenvectors to the covariance matrix of the genes and the eigengenes (=normalized PCA projections) are eigenvectors but not to the covariance matrix of the samples.
Well, I see PCA as working on data matrix X, and I try to keep a geometrical interpretation of it. I think of PCA as a maximization problem and to me it makes no sense to maximize mean squared sum around zero unless you have a reason why zero is special. Natural choice is to maximize around center of mass. Centralizing in the other direction seems odd to me. It implies, after normalization, each data point lies on hyperplane sum(x)=0. This is a projection , which is twisting the geometrical interpretation. This becomes clear if you think about a 2D case. Projection down to y=-x and then performing SVD, the first PC will be, that's right, y=-x.
However, this is just my view of the analysis, and if you say that a typical user of PCA is using that way, I think we could support it.
I think of it more as if PCA(X) gives both eigengenes and eigenarrays and so do PCA(X'). Different normalizations give different eigengenes and eigenarrays. PCA(X) and PCA(X') normalize differently so they give different eigengenes and eigensamples.
Either way, I think most of this is semantics that could/should be avoided in the PCA interface. What I do want is access to V (=normalized projections), because it would be slow to first project data onto U (accessible through the interface) and then normalize the projections when this useful information is already calculated internally.
I don't think we should support different normalizations (at least not for milestone:0.4).
OK, but we should keep it in mind, see more below.
My mind is not large enough, so I opened a ticket:302 for this.
Well, I think calculating correlations are supported elsewhere, i.e., I don't want it here.
I do not think this is independent from the normalization issue. Often one wants to calculate correlations to components. But in the present structure this is not feasible since if a user comes along with a test sample it has to be centralized using the centroid from the training data, which is internal to the PCA class.
Well, we could have a function centralizing a sample. With respect to ticket:302 we should not call the function centralize but something more generic.
So the current interface hinders a lot of common PCA applications.
Yeah I can see that. And again the reason the interface is so small is that me and Jari found flaws very close to the 0.3 release, and we decided it's better to keep the interface small than having a lot of member functions that need to be changed anyway. The most serious flaw is the assumption on M>N.
comment:9 Changed 16 years ago by
Replying to peter:
Well, I see PCA as working on data matrix X, and I try to keep a geometrical interpretation of it. I think of PCA as a maximization problem and to me it makes no sense to maximize mean squared sum around zero unless you have a reason why zero is special. Natural choice is to maximize around center of mass. Centralizing in the other direction seems odd to me. It implies, after normalization, each data point lies on hyperplane sum(x)=0. This is a projection , which is twisting the geometrical interpretation. This becomes clear if you think about a 2D case. Projection down to y=-x and then performing SVD, the first PC will be, that's right, y=-x.
I do not think one should see the normalized projected data as weirdly centered principal components, but as something different. First you perform PCA in one direction, say viewing the columns as (N) data instances and the rows as (M) variables. In the PCA it is as you say natural to maximize around the center of mass, and in this case the rows are centered. The principal components will now be vectors of length (M) and they will be eigenvectors to the covariance matrix of the variables "eigeninstances". Suppose one is interested in the variable-like pattern across data instances associated with the same variation in the data as a principal component. This pattern is then given by the normalized projection of the data onto the principal component: "eigenvariable". This pattern is in general not an eigenvector to the covariance matrix of the data instances. But rather than being viewed as a weird principal component to an incorrect covariance matrix of the data instances it should be viewed as the pattern across data instances associated with the direction (linear combinations of variables) found by the PCA. Decomposing matrices such that rows and columns for example can be visualized in the same plot is a general problem. A decomposition X=RC where R contains row information and C column information is not unique. SVD is an appealing choice.
comment:10 follow-up: 18 Changed 16 years ago by
Description: | modified (diff) |
---|---|
Summary: | PCA does only a naive decomposition → PCA is lacking functionality |
This is an attempt to summarize the discussion here into a list of items that a PCA class could have in addition to the current functionality. This is aimed as a starting point.
- PCA should allow different normalizations. This has been split into ticket:302.
- PCA should work seemlessly regardless of whether the number of variables or the number of data points is largest. As in the current code, columns should correspond to data points and rows to variables for the data points. If a user wants to swap this the user has to transpose the matrix prior to using PCA, i.e. there should be no functions for the transposed problem.
- PCA should provide normalized projections of the original data. This means providing variable-like patterns associated with the same variation in the data as the principal components, which are data-point-like patterns.
- For additional data, PCA should support both projections onto the principal components and correlations with the principal components.
- PCA should support reconstructing data in the original variable space from projected data.
- Functionality corresponding to setting eigenvalues to zero should be supported. Such that for example data can be reconstructed in the original variable space but filtered for variation associated with some components (for example components that correlate with some artifact in the data).
comment:11 Changed 16 years ago by
Milestone: | 0.4 → 0.5 |
---|
comment:12 Changed 15 years ago by
Milestone: | yat 0.5 → yat 0.x+ |
---|
comment:13 Changed 13 years ago by
comment:14 Changed 6 weeks ago by
Milestone: | yat 0.x+ → yat 0.22 |
---|
I was bitten by this limitation (point 2 above). My use case was that I had 104 samples in a 6D space and wanted to project the data onto the first two PCs. It's time to fix that limitation. Not sure if the other points above are still lacking.
comment:15 Changed 3 weeks ago by
Owner: | changed from Jari Häkkinen to Peter |
---|---|
Status: | new → accepted |
comment:18 follow-up: 21 Changed 2 weeks ago by
Replying to Markus Ringnér:
This is an attempt to summarize the discussion here into a list of items that a PCA class could have in addition to the current functionality. This is aimed as a starting point.
- PCA should allow different normalizations. This has been split into ticket:302.
- PCA should work seemlessly regardless of whether the number of variables or the number of data points is largest. As in the current code, columns should correspond to data points and rows to variables for the data points. If a user wants to swap this the user has to transpose the matrix prior to using PCA, i.e. there should be no functions for the transposed problem.
- PCA should provide normalized projections of the original data. This means providing variable-like patterns associated with the same variation in the data as the principal components, which are data-point-like patterns.
- For additional data, PCA should support both projections onto the principal components and correlations with the principal components.
- PCA should support reconstructing data in the original variable space from projected data.
- Functionality corresponding to setting eigenvalues to zero should be supported. Such that for example data can be reconstructed in the original variable space but filtered for variation associated with some components (for example components that correlate with some artifact in the data).
1) has been lifted out to ticket:302 2) works now 3) I don't understand this point (not surprisingly after reading the discussion above) 4) Perhaps. The class should definitely provide access to the mean vector. Then users can do whatever they like having both the mean vector and the eigenvectors. I think we already provide the projection on the PCs and not sure what correlation with PC means (it must involve centralization to make sense) 5) I don't like this. Projecting data sometimes means losing information, which makes the reverse operation impossible (unlosing information). 6) TODO
comment:21 Changed 11 days ago by
Replying to Peter:
5) I don't like this. Projecting data sometimes means losing information, which makes the reverse operation impossible (unlosing information).
I take back this. One does lose information in the case when there are fewer data points than dimensions or more exactly when the space spanned by the PCc are not the entire space. In that case it makes sense that the reconstructed data is the closest point in the subspace spanned by the PCs, i.e., the projection onto that hyperplane. The function name projection in the API is a bit funny as the function translates from the original coordinates to the coordinates defined by the PCs. It's not necessarily a projection in mathematical sense.
I can see the use case, as Markus mentions above, that one wanna filter out say PC2 because it's associated with a systamatic error (e.g batch effect) and all the smaller PCs and only keep PC1, PC3 and PC4. So I think we should add a function that prjects on the hyperplane spanned by these (user defined) PCs and return something in original coordinates. I don't think we should call it projection though because it's already taken and will be confusing. Perhaps we can call it filter?
Markus's request above is a sub-operation of that. He wants a function that takes from PC coordinates to original coordinates. Naively it's just the eigenvectors multiplied by vector of coordinates in PC space, and then I would be reluctant, but one has to also take care of the normalisation/centralisation in reverse, which is tricky enough to warrant a function for it. Markus calls this reconstruct, some people call it recast, I'm not really sure what the best function name would be.
comment:22 Changed 10 days ago by
While we are visiting this class, why not make the centralisation optional and related to that is a scaling (to uity norm). I can be added to the constructors such as
PCA(const Matrix&, bool centralize=true, bool scale=false)
SVD generates the eigenvectors for both problems simultaneously in U and V. Suppose we are interested the XX' problem. With X=UWV' and D=WW we get:
XX'=UWV'VWU'=UDU'
so in this case the principal components are in U and the projected data is:
U'X=U'UWV'=WV'
The opposite problem is
X'X=VWU'UWV'=VDV'
So the eigenvectors to the opposite problem V are given by the (normalized) projection to the original problem. Hence the opposite problem is already solved by the projection of the original problem.
Therefore, I think there should be support for the opposite problem, not by having the choice of two different problems to solve, but through an interface that support both directions simultaneously without calling them the problem and the opposite problem. This would include the ability to return both types of eigenvectors, having projections for both N-like and M-like objects, and allow for different normalizations (e.g. centering columns instead of rows such that users can choose whether X'X or XX' is proportional to a co-variance matrix). Internally PCA would have to keep track of what U and V map to for the two cases N>M and M>N so users can send in a matrix regardless of whether N or M is largest. All that would need to be specified is whether rows/columns are variables/data-points then all public functions could be named accordingly and a user would have access to everything (through U and V) without calling two different "solve" functions.