Opened 13 years ago

Closed 13 years ago

#366 closed request (fixed)

specialize percentile (and thereby median indirectly) for weighted iterators

Reported by: Peter Owned by: Peter
Priority: major Milestone: yat 0.5
Component: statistics Version: trunk
Keywords: Cc:

Description (last modified by Peter)

need ticket:369

Not really needed now since we have no weighted mutable iterators.

However, if ticket:363 is gonna be implemented we'll have it.

Change History (11)

comment:1 Changed 13 years ago by Peter

Description: modified (diff)

This is needed for ticket:20.

comment:2 Changed 13 years ago by Peter

Description: modified (diff)

fixed typo in descr.

comment:3 Changed 13 years ago by Peter

Status: newassigned

comment:4 Changed 13 years ago by Peter

Description: modified (diff)

comment:5 Changed 13 years ago by Peter

A mutable weighted iterator is needed to test this, for example ticket:363 or ticket:368

comment:6 Changed 13 years ago by Peter

(In [1404]) refs #366 - weighted percentile Added a class Percentile that can calculate the percentile for both unweighted and weighted ranges. In order to make sense of the weighted case, I had to modify the definition of percentile (slightly). The old function percentile is using the old definition, but a new function, percentile2, is using the new function and is simply calling the Percentile class. The median is the same for the two definitions and therefore it makes no difference which function to call, but to enable calculation of weighted median percentile2 is called.

comment:7 Changed 13 years ago by Peter

When extending the tests, I realized I'm not too happy with the implementation in [1404]. Consider some weighted data: {1,1}, {2,1}, {4,2}. What is the median? I would like it to be three because it would make it equivalent with unweighted data: 1, 2, 4, 4. Have to think more here. Perhaps the old definition was not that bad after all...

comment:8 Changed 13 years ago by Peter

I've looked around for a different definitions of percentile, and there seems to be no consensus agreement.

What is clear is that median is (a bit sloppy) defined as "the middle point" and in case of an even number of data points one often define it as the mean of the two middle points. Also it makes sense that the 25th percentile (aka 1st quartile) should be the median of the lower half of the data, and similar 75th percentile (aka 3rd quartile) should be median of upper half of the data.

That is not the case for the current implementation in percentile (taken from GSL), in which x={0,1,2,3,4,5} gives a 25th percentile 1.25 when I would expect 1.0 based on the reasoning above. percentile2 gives the desired answer in this case.

OTOH, percentile2 behaves weird when having weights, as mentioned in previous post. To dissect that behaviour we need to go back to out three corner stones of weighted statistics:

  1. all w=1 should give the unweighetd case
  2. invariant under rescaling w -> lw
  3. one weight w_i = 0 equivalent to excluding that data point

Combining 1) and 2) we can see that, for example, duplicating data should have no effect, since it essentially means doubling all weights. That implies that {1,2,3,4,5} and {1,1,2,2,3,3,4,4,5,5} should give the same answer. For the median, 3, that is clearly the case, so let's look at 1st quartile instead. For the larger set the lower half is {1,1,2,2,3} of which the median is 2, which implies that the 1st quartile should be 2 also for {1,2,3,4,5}. In other words, we should not interpolate some value between 1 and 2. Instead we just take the nearest number. In some sense, having five number means that 1st number corresponds to 10th percentile, 2nd to the 30th percentile, 3rd to 50th percentile and so on and so forth. Most often we are not lucky to hit exactly at thos numbers, and then one take the Nearest Number, which in the example above was the 30th percentile which is

  1. In the borde case when ending up in the middle of to points, one

can take the average as is standard for the median in case of an even number of data points.

Numerically, there is a problem though. The case when we exactly in between is not a good thing when having finite accuracy and rounding errors. Instead one must allow some errors, but how much?

I will change the implementation in Percentiler to follow this definition/algorithm.

comment:9 Changed 13 years ago by Peter

(In [1418]) refs #366 - need to polish on some special cases

comment:10 Changed 13 years ago by Peter

(In [1420]) fixing the weighted implementation of Percentiler - refs #366. Some documentation is needed before closing the ticket.

comment:11 Changed 13 years ago by Peter

Resolution: fixed
Status: assignedclosed

(In [1422]) docs - fixes #366

Note: See TracTickets for help on using tickets.