Opened 13 years ago

Last modified 13 years ago

## #346 new enhancement

# Generalize NNI to utilize different metrics — at Initial Version

Reported by: | Peter | Owned by: | Jari Häkkinen |
---|---|---|---|

Priority: | major | Milestone: | yat 0.x+ |

Component: | utility | Version: | trunk |

Keywords: | Cc: |

### Description

It has been reported that is might be beneficial to use a correlation based distance rather than an Euclidean one (see e.g. Brock et al.).

These seems trivial to simply add a Distance class and thereby achieve the generalization. However, one'd better be careful here, because changing the metric should not only be reflected in calculation of distance between rows (i.e. genes). It should also be reflected in the imputation equation (see eq. 10 in WeNNI paper).

In equation 10, imputation value is simply a weighted average of the values of the nearest neighbors. This is motivated because you assume that nearest neighbors should also be close in the sample we are imputing. Using a semi-positive definite distance measure such as PearsonDistance, this is suboptimal. The reason is that two vectors can have a very small distance even though element values are very different. The distance can actually be zero also for non-identical vectors. In other words, the distance is translational and scale invariant. Say for instance that we have the following small matrix

0 12 0 12 mv 4 8 4 8 4

Obviously imputing the missing value to 4 here is not optimal - the vectors have a zero distance prior imputation and post imputation the distance would no longer be zero. Rather we would like to set the value to 0. How can that be achieved? Well, the key is in the invariants mentioned above. We remember that correlation based distance is equivalent to a z-score normalization followed by Euclidean distance. Therefore, it would be tempting to z-score each row and then use Euclidean distance to impute values. However, there are some disadvantages with that approach basically because a correlation calculation is based solely on pairs of data present in both vectors. Therefore, the average and variance used in correlation calculation would be different from the one in the z-score normalization, which would yield unwanted behavior such as mentioned above. A better approach would be to perform the z-score normalization based on the same data that is used in calculation of the distance.

`y' = (y-m)/s`

The missing value can then be imputed from the neighbor: `y' = x'`

and the z-score is reversed to get back the original average and scale `y = s*y'+m`

(well technically they will be different due to imputed values, but almost...).

Implementation-wise there is probably no reason to perform the normalization back and forth. Instead one could calculate the correlation distance using AveragerPairWeighted and thereby getting the nearest neighbors. the trick is then to calculate the imputation value directly using averages and variances of x and y.

`y = s_x * x' + m_x = s_x * y' + m_x = s_x * (y-m_y) / s_y + m_x `

which is simply the equation for LS regression.

**Note:**See TracTickets for help on using tickets.