Opened 14 years ago

Closed 14 years ago

#210 closed enhancement (wontfix)

Distance should take DataLookupWeighted1D

Reported by: Peter Owned by: Peter
Priority: major Milestone: yat 0.4
Component: statistics Version: trunk
Keywords: Cc:

Description (last modified by Markus Ringnér)

relates to ticket:237

could be implemented in base class as a virtual by copying and data into vectors and calling other function. Leave function virtual just liek the other functions implemented in Distance.

Change History (14)

comment:1 Changed 14 years ago by Peter

Description: modified (diff)

ticket:237 is marked as related

comment:2 Changed 14 years ago by Markus Ringnér

I do not think we should add a function taking a DataLookupWeighted1D because: 1) Then one should perhaps add a multitude of functions supporting things in the future. 2) statistics should depend minimally on classifier.

Instead I think we should go for a templatized member function that calculates the distance between two classes T1,T2 both having "double operator()". Since the weighted functions return weight*data through their operators such a templatized member function will then claculate distance between pairs of many different vector-like objects..

comment:3 Changed 14 years ago by Peter

I don't understand how you can templatize that and make it work.

You cant just simply replace data with weight*data. That is not even true for the mean/weighted mean. Weighted statistics is slightly more complex than that.

comment:4 Changed 14 years ago by Markus Ringnér

OK, so operator() is not a sufficent requirement. But I still think we should make a templatized function that can calculate distances for anything that can return both a value and a weight through two well-defined functions/operators. And another templatized function for unweighted things. Rather than a long list of functions taking different vector-like objects.

comment:5 Changed 14 years ago by Peter

Ok, this sounds like a situation for traits!!!

On general template for operator() or should we go for operator[] (to allow also std::vector)

And one specialized template for classes having data() and weight() functions.

comment:6 Changed 14 years ago by Markus Ringnér

Description: modified (diff)

1) PearsonDistance? makes use of AveragerPair? so just making templates in PearsonDistance? does not work. In this case the templates stuff would have to be in AveragerPair? also. Right?

2) I have been thinking about traits a little bit and I do not see an easy way to do this. Even if we can get it to work it seems as if we need individual traits specializations for every vector-like object we are intending to get to work with Distance (maybe more elegant to implement such a creature than adding a member function to Distance but ...). If this really can be solved with traits maybe we should start sketching on a way to do it here, as an exercise to really learn traits. But we should also think about if some alternative template strategy is optimal for this problem.

So what should template <class vector_T> struct vector_traits {}; look like default and how do we specialize it for DataLookupWeighted? ?

comment:7 Changed 14 years ago by Peter

Component: classifierstatistics

1) Well, Euclidean makes use of AveragerPair too. Both classes are filling up AveragerPair and calculating the distance based on them. I thought about utilizing this overlap, but cant see a good way of doing that without losing generality in the base class.

2) There is a good item in one of H. Sutter's books (I think it is item 30 in more exceptional C++) describing a similar situation. I'll bring the book to work tomorrow and I can be more specific.

It is a bit unclear to me though, what is the idea here? What are we aiming at templatizing? What will be template, and what will be virtual? It will be good if you could clarified...

comment:8 in reply to:  7 Changed 14 years ago by Markus Ringnér

Replying to peter:

1) Well, Euclidean makes use of AveragerPair too. Both classes are filling up AveragerPair and calculating the distance based on them. I thought about utilizing this overlap, but cant see a good way of doing that without losing generality in the base class.

So it is not so easy to get templatized member functions ...

2) There is a good item in one of H. Sutter's books (I think it is item 30 in more exceptional C++) describing a similar situation. I'll bring the book to work tomorrow and I can be more specific.

I tried to find it but am not sure which item you are referring to.

It is a bit unclear to me though, what is the idea here? What are we aiming at templatizing? What will be template, and what will be virtual? It will be good if you could clarified...

Well you wrote earlier:

Ok, this sounds like a situation for traits!!! On general template for operator() or should we go for operator[] (to allow also std::vector) And one specialized template for classes having data() and weight() functions.

So I thought you had a clear idea... I was originally thinking that we should just have a templatized member function, double operator()(T1,T2), in Distance that calculated the distance between vector-types T1 and T2. Then traits or some other template baggage checking method would be nice, if one could get it to call either vector operator()/[] or data+weight depending on the traits. This function would then have to be overloaded in derived classes to calculate the proper distances. Does this work? An alternative would be to templatize the whole class Distance, but then I am not sure how to get any class hierarchy in. In this latter case I guess we have to make everything a template both the vector types and the distance. However the structure for this is messy with both e.g. PearsonDistance? and Pearson and this would require a big but perhaps interesting redesign.

comment:9 Changed 14 years ago by Markus Ringnér

An alternative which is appealing at first sight is to do this with iterators instead. I think if we have an iterator that returns operator() and unity for weights for some vector types and a derived iterator that returns data() and weight() then all we would have to do with Distance is to have an "double operator()(Iterator&,Iterator&)"

comment:10 Changed 14 years ago by Peter

Ok, so we are talking about a templatized pure virtual member function in Distance that is implmented in daughter classes.

I would like to factorize the issues a bit:

  1. Passing Iterators or Lookups (well any container)?
  2. Should we support both weighted and non-weighted?

On question 1. I think the alternative are very similar. Iterators solution is perhaps a bit more general. It is more STL-alike, which is appealing. If we are careful it will perhaps even be possible to calculate the distance between 2 std::map. I like the Iterator, but then we should aim at implementing iterators for all 1D containers, in other words, also for utility::vector.

On question 2 an alternative is, of course, NOT to throw away weighted statistics, but we could ignore the non-weighted and just accept that non-weighted statistics is a special case of non-weighted statistics (all w=1). So why are we struggling with having two implementations? The only reason I can remember is speed (runtime not development). I have concidered to suggest that we should trash MatrixLookup and all other functionality that are pure non-weighted, but I'm not sure it is a good idea. After all the only design decision we have taken in this project is that "missing values should be handled with weights equal zero not with NaNs?", which sort of implies that we prefer having a non-weighted GSL-style version that is efficient (in runtime). In summary I think we should keep both weighted and non-weighted.

Ok, so what does this imply on this Distance issue? Well, it means that if you pass DataLookup1d::Iterator to Distance then non-weighted statistics should be used. This is here traits enter the stage. I think it should be possible to have one general template for nor-weighted iterator just requiring ForwardIterator? and on more specialized template requring that operator* returns std::pair<double, double>. The item I'm refering to is item 31 in Sutter's more exceptional C++. There is a section named Applying Traits (starting on page 194 in my version).

Another thing: I think we should try to follow the advice to make public function non-virtual, i.e., separating interface from implementation. The public non-virtual function could of course call a private/protected virtual function.

Yet another thing: If we implement iterators, I think we should change the interface of the Averagers as well. They have functions taking containers should be iterators.

comment:11 in reply to:  10 Changed 14 years ago by Markus Ringnér

Replying to peter:

Ok, so we are talking about a templatized pure virtual member function in Distance that is implmented in daughter classes.

I would like to factorize the issues a bit:

  1. Passing Iterators or Lookups (well any container)?
  2. Should we support both weighted and non-weighted?

On question 1. I think the alternative are very similar. Iterators solution is perhaps a bit more general. It is more STL-alike, which is appealing. If we are careful it will perhaps even be possible to calculate the distance between 2 std::map. I like the Iterator, but then we should aim at implementing iterators for all 1D containers, in other words, also for utility::vector.

I think 1 is nicer because then the classes in statistics (Distance,Averager,...) could be written with an interface that accepts iterators (to std, lookups, utility::vector,...) and hence statistics is more "independent of " classifier. In particular, as more vector-like objects appear ideally nothing has to be changed in statistics to make statistics use of the new vectors. The question then is where do the iterators belong?

On question 2 an alternative is, of course, NOT to throw away weighted statistics, but we could ignore the non-weighted and just accept that non-weighted statistics is a special case of non-weighted statistics (all w=1). So why are we struggling with having two implementations? The only reason I can remember is speed (runtime not development). I have concidered to suggest that we should trash MatrixLookup and all other functionality that are pure non-weighted, but I'm not sure it is a good idea. After all the only design decision we have taken in this project is that "missing values should be handled with weights equal zero not with NaNs?", which sort of implies that we prefer having a non-weighted GSL-style version that is efficient (in runtime). In summary I think we should keep both weighted and non-weighted.

I agree (we made this decision earlier).

Ok, so what does this imply on this Distance issue? Well, it means that if you pass DataLookup1d::Iterator to Distance then non-weighted statistics should be used. This is here traits enter the stage. I think it should be possible to have one general template for nor-weighted iterator just requiring ForwardIterator? and on more specialized template requring that operator* returns std::pair<double, double>. The item I'm refering to is item 31 in Sutter's more exceptional C++. There is a section named Applying Traits (starting on page 194 in my version).

All I am sure of is, unless there are two Distance funtions with different names or signatures, traits is the best way to go: they can only both be "double operator()(Iterator&,Iterator&) without traits if we let if-then-else-ugliness due to poor design come into play and then we are back to slow, poor separation of weighted and unweighted.

Another thing: I think we should try to follow the advice to make public function non-virtual, i.e., separating interface from implementation. The public non-virtual function could of course call a private/protected virtual function.

OK, but we do not do this consistently in yat.

Yet another thing: If we implement iterators, I think we should change the interface of the Averagers as well. They have functions taking containers should be iterators.

Of course, see above.

comment:12 Changed 14 years ago by Markus Ringnér

Remember that we now have different functions with different signatures for weighted and unweighted in Distance and other places. So the iterator solution with different signatures and without traits: one iterator hierarchy for unweighted containers and one for weighted would provide a nice separation of containers (utility,classifier) from algorithms (statistics), and allow for statistics to be used with future containers without modifications to statistics, without making things more cluttered than they are presently. The nice thing with using traits is that it to the next level: weighted and unweighted in a common fashion. This would be in line with how we have a common hierarchy for weighted and unweighted lookups in classifier. To summarize: I think going to iterators and going to traits address (and hopefully solves) two different problems. Should both be solved? Priorities?

comment:13 Changed 14 years ago by Peter

Milestone: later0.4
Priority: minormajor

This ticket has been splitted into ticket:244 and ticket:245 (and more)

comment:14 Changed 14 years ago by Peter

Resolution: wontfix
Status: newclosed

Distance has been replaced by template functions (see ticket:250)

Note: See TracTickets for help on using tickets.