## #253 closed defect (fixed)

# weighted euclidean in euclidean_vector_distance.h has some weird properties

Reported by: | Peter | Owned by: | Peter |
---|---|---|---|

Priority: | major | Milestone: | yat 0.4 |

Component: | statistics | Version: | trunk |

Keywords: | Cc: |

### Description (last modified by )

It returns `sqrt(ap.sum_squared_deviation())`

that is sqrt (sum [w_x w_y (x-y)^{2} ]) which is a bit weird. In WeNNI we use the slightly different sum [w_x w_y (x-y)^{2} ] / sum [w_x w_y]. Otherwise - speaking in WeNNI terms - a gene that has very poor quality (small weights) will be close to all other genes. So your nearest neighbor will very likely be a gene with poor quality. The same happens, of course, for nearest neighbor classifier (NNC) that samples with poor quality tend to be nearest neighbors, which much be unsound.

I suggest we use the distance we use in WeNNI (or the sqrt of it).

### Change History (10)

### comment:1 Changed 13 years ago by

Description: | modified (diff) |
---|

### comment:2 Changed 13 years ago by

Description: | modified (diff) |
---|

### comment:3 Changed 13 years ago by

Fine. When I did it I was going to ask you to look at it because I was not sure, but I forgot to mention it. It should however work in a way that is "analogous" to the weighted Pearson distance. Do they?

### comment:4 Changed 13 years ago by

The weighted Pearson I think we have taken from Patrik. It has some nice properties that we try to have for all weighted statistics

- A weight equal to zero is the same as removing that data point

- All weights equal to unity -> non-weighted version

- Invariant under rescaling all weights

`sum_squared_deviation`

breaks number 3. and that is the source to the strange behaviour. If we use msd() it will behave very much like Pearson. My only concern is that we break number 2. because in the non-weighted case the distance is clearly `n*msd()`

. One could add a weighted version of `n`

to compensate for this but is probably to make it too complicated (at least that is what I decided when I implemented Euclidean).

### comment:5 Changed 13 years ago by

I think we should make sure number 2 is satisfied otherwise users of things downstream, such as NCC will be very confused.

Note: when I implemented it there was no summed_squared_deviation in Weighted only msd (with an explicit implementation). All I did was look at Unweighted where msd was implemented as summed_squared_deviation/n. So I tried to change Weighted analogously. There was no more thinking behind it than that.

### comment:6 Changed 13 years ago by

Ok, either we implement euclidean as `sqrt(n()*msd())`

or as `sqrt(msd())`

. The latter is bit confusing since that is not what we normally mean by mean distance, so I guess we go for the first alternative.

n() here is not `n`

but a weighted version (see AveragerWeighted?) fulfilling 1-3. This is a quick fix for me, so if you just give me green light...

### comment:7 Changed 13 years ago by

You have thought much more about weighted statistics than me. If you can get something that fulfills all 3 requirements so certainly go ahead.

### comment:8 follow-up: 10 Changed 13 years ago by

Status: | new → assigned |
---|

Yes, but my question was whether we should keep non-weighted euclidean as himself once defined it or if we could have a normalizing factor n. I think we go for not modifying the definition of non-weighted Euclidean distance - else speak up.

### comment:10 Changed 13 years ago by

Replying to peter:

Yes, but my question was whether we should keep non-weighted euclidean as himself once defined it or if we could have a normalizing factor n. I think we go for not modifying the definition of non-weighted Euclidean distance - else speak up.

I agree: unweighted Euclidean distance is a standard definition and we should not return some other value.

**Note:**See TracTickets for help on using tickets.

fixing superscript