Opened 16 years ago

Closed 16 years ago

# weighted euclidean in euclidean_vector_distance.h has some weird properties

Reported by: Owned by: Peter Peter major yat 0.4 statistics trunk

It returns `sqrt(ap.sum_squared_deviation())` that is sqrt (sum [w_x w_y (x-y)2 ]) which is a bit weird. In WeNNI we use the slightly different sum [w_x w_y (x-y)2 ] / sum [w_x w_y]. Otherwise - speaking in WeNNI terms - a gene that has very poor quality (small weights) will be close to all other genes. So your nearest neighbor will very likely be a gene with poor quality. The same happens, of course, for nearest neighbor classifier (NNC) that samples with poor quality tend to be nearest neighbors, which much be unsound.

I suggest we use the distance we use in WeNNI (or the sqrt of it).

### comment:1 Changed 16 years ago by Peter

Description: modified (diff)

### comment:2 Changed 16 years ago by Peter

Description: modified (diff)

fixing superscript

### comment:3 Changed 16 years ago by Markus Ringnér

Fine. When I did it I was going to ask you to look at it because I was not sure, but I forgot to mention it. It should however work in a way that is "analogous" to the weighted Pearson distance. Do they?

### comment:4 Changed 16 years ago by Peter

The weighted Pearson I think we have taken from Patrik. It has some nice properties that we try to have for all weighted statistics

1. A weight equal to zero is the same as removing that data point

1. All weights equal to unity -> non-weighted version

1. Invariant under rescaling all weights

`sum_squared_deviation` breaks number 3. and that is the source to the strange behaviour. If we use msd() it will behave very much like Pearson. My only concern is that we break number 2. because in the non-weighted case the distance is clearly `n*msd()`. One could add a weighted version of `n` to compensate for this but is probably to make it too complicated (at least that is what I decided when I implemented Euclidean).

### comment:5 Changed 16 years ago by Markus Ringnér

I think we should make sure number 2 is satisfied otherwise users of things downstream, such as NCC will be very confused.

Note: when I implemented it there was no summed_squared_deviation in Weighted only msd (with an explicit implementation). All I did was look at Unweighted where msd was implemented as summed_squared_deviation/n. So I tried to change Weighted analogously. There was no more thinking behind it than that.

### comment:6 Changed 16 years ago by Peter

Ok, either we implement euclidean as `sqrt(n()*msd())` or as `sqrt(msd())`. The latter is bit confusing since that is not what we normally mean by mean distance, so I guess we go for the first alternative.

n() here is not `n` but a weighted version (see AveragerWeighted?) fulfilling 1-3. This is a quick fix for me, so if you just give me green light...

### comment:7 Changed 16 years ago by Markus Ringnér

You have thought much more about weighted statistics than me. If you can get something that fulfills all 3 requirements so certainly go ahead.

### comment:8 follow-up:  10 Changed 16 years ago by Peter

Status: new → assigned

Yes, but my question was whether we should keep non-weighted euclidean as himself once defined it or if we could have a normalizing factor n. I think we go for not modifying the definition of non-weighted Euclidean distance - else speak up.

### comment:9 Changed 16 years ago by Peter

Resolution: → fixed assigned → closed

fixed in [889]