Opened 14 years ago

Closed 14 years ago

Last modified 14 years ago

#253 closed defect (fixed)

weighted euclidean in euclidean_vector_distance.h has some weird properties

Reported by: Peter Owned by: Peter
Priority: major Milestone: yat 0.4
Component: statistics Version: trunk
Keywords: Cc:

Description (last modified by Peter)

It returns sqrt(ap.sum_squared_deviation()) that is sqrt (sum [w_x w_y (x-y)2 ]) which is a bit weird. In WeNNI we use the slightly different sum [w_x w_y (x-y)2 ] / sum [w_x w_y]. Otherwise - speaking in WeNNI terms - a gene that has very poor quality (small weights) will be close to all other genes. So your nearest neighbor will very likely be a gene with poor quality. The same happens, of course, for nearest neighbor classifier (NNC) that samples with poor quality tend to be nearest neighbors, which much be unsound.

I suggest we use the distance we use in WeNNI (or the sqrt of it).

Change History (10)

comment:1 Changed 14 years ago by Peter

Description: modified (diff)

comment:2 Changed 14 years ago by Peter

Description: modified (diff)

fixing superscript

comment:3 Changed 14 years ago by Markus Ringnér

Fine. When I did it I was going to ask you to look at it because I was not sure, but I forgot to mention it. It should however work in a way that is "analogous" to the weighted Pearson distance. Do they?

comment:4 Changed 14 years ago by Peter

The weighted Pearson I think we have taken from Patrik. It has some nice properties that we try to have for all weighted statistics

  1. A weight equal to zero is the same as removing that data point

  1. All weights equal to unity -> non-weighted version

  1. Invariant under rescaling all weights

sum_squared_deviation breaks number 3. and that is the source to the strange behaviour. If we use msd() it will behave very much like Pearson. My only concern is that we break number 2. because in the non-weighted case the distance is clearly n*msd(). One could add a weighted version of n to compensate for this but is probably to make it too complicated (at least that is what I decided when I implemented Euclidean).

comment:5 Changed 14 years ago by Markus Ringnér

I think we should make sure number 2 is satisfied otherwise users of things downstream, such as NCC will be very confused.

Note: when I implemented it there was no summed_squared_deviation in Weighted only msd (with an explicit implementation). All I did was look at Unweighted where msd was implemented as summed_squared_deviation/n. So I tried to change Weighted analogously. There was no more thinking behind it than that.

comment:6 Changed 14 years ago by Peter

Ok, either we implement euclidean as sqrt(n()*msd()) or as sqrt(msd()). The latter is bit confusing since that is not what we normally mean by mean distance, so I guess we go for the first alternative.

n() here is not n but a weighted version (see AveragerWeighted?) fulfilling 1-3. This is a quick fix for me, so if you just give me green light...

comment:7 Changed 14 years ago by Markus Ringnér

You have thought much more about weighted statistics than me. If you can get something that fulfills all 3 requirements so certainly go ahead.

comment:8 Changed 14 years ago by Peter

Status: newassigned

Yes, but my question was whether we should keep non-weighted euclidean as himself once defined it or if we could have a normalizing factor n. I think we go for not modifying the definition of non-weighted Euclidean distance - else speak up.

comment:9 Changed 14 years ago by Peter

Resolution: fixed
Status: assignedclosed

fixed in [889]

comment:10 in reply to:  8 Changed 14 years ago by Markus Ringnér

Replying to peter:

Yes, but my question was whether we should keep non-weighted euclidean as himself once defined it or if we could have a normalizing factor n. I think we go for not modifying the definition of non-weighted Euclidean distance - else speak up.

I agree: unweighted Euclidean distance is a standard definition and we should not return some other value.

Note: See TracTickets for help on using tickets.