Opened 15 years ago
Closed 15 years ago
#210 closed enhancement (wontfix)
Distance should take DataLookupWeighted1D
Reported by: | Peter | Owned by: | Peter |
---|---|---|---|
Priority: | major | Milestone: | yat 0.4 |
Component: | statistics | Version: | trunk |
Keywords: | Cc: |
Description (last modified by )
relates to ticket:237
could be implemented in base class as a virtual by copying and data into vectors and calling other function. Leave function virtual just liek the other functions implemented in Distance.
Change History (14)
comment:1 Changed 15 years ago by
Description: | modified (diff) |
---|
comment:2 Changed 15 years ago by
I do not think we should add a function taking a DataLookupWeighted1D because: 1) Then one should perhaps add a multitude of functions supporting things in the future. 2) statistics should depend minimally on classifier.
Instead I think we should go for a templatized member function that calculates the distance between two classes T1,T2 both having "double operator()". Since the weighted functions return weight*data through their operators such a templatized member function will then claculate distance between pairs of many different vector-like objects..
comment:3 Changed 15 years ago by
I don't understand how you can templatize that and make it work.
You cant just simply replace data
with weight*data
. That is not even true for the mean/weighted mean. Weighted statistics is slightly more complex than that.
comment:4 Changed 15 years ago by
OK, so operator() is not a sufficent requirement. But I still think we should make a templatized function that can calculate distances for anything that can return both a value and a weight through two well-defined functions/operators. And another templatized function for unweighted things. Rather than a long list of functions taking different vector-like objects.
comment:5 Changed 15 years ago by
Ok, this sounds like a situation for traits!!!
On general template for operator() or should we go for operator[] (to allow also std::vector)
And one specialized template for classes having data() and weight() functions.
comment:6 Changed 15 years ago by
Description: | modified (diff) |
---|
1) PearsonDistance? makes use of AveragerPair? so just making templates in PearsonDistance? does not work. In this case the templates stuff would have to be in AveragerPair? also. Right?
2) I have been thinking about traits a little bit and I do not see an easy way to do this. Even if we can get it to work it seems as if we need individual traits specializations for every vector-like object we are intending to get to work with Distance (maybe more elegant to implement such a creature than adding a member function to Distance but ...). If this really can be solved with traits maybe we should start sketching on a way to do it here, as an exercise to really learn traits. But we should also think about if some alternative template strategy is optimal for this problem.
So what should template <class vector_T> struct vector_traits {}; look like default and how do we specialize it for DataLookupWeighted? ?
comment:7 follow-up: 8 Changed 15 years ago by
Component: | classifier → statistics |
---|
1) Well, Euclidean
makes use of AveragerPair
too. Both classes are filling up AveragerPair
and calculating the distance based on them. I thought about utilizing this overlap, but cant see a good way of doing that without losing generality in the base class.
2) There is a good item in one of H. Sutter's books (I think it is item 30 in more exceptional C++) describing a similar situation. I'll bring the book to work tomorrow and I can be more specific.
It is a bit unclear to me though, what is the idea here? What are we aiming at templatizing? What will be template, and what will be virtual? It will be good if you could clarified...
comment:8 Changed 15 years ago by
Replying to peter:
1) Well,
Euclidean
makes use ofAveragerPair
too. Both classes are filling upAveragerPair
and calculating the distance based on them. I thought about utilizing this overlap, but cant see a good way of doing that without losing generality in the base class.
So it is not so easy to get templatized member functions ...
2) There is a good item in one of H. Sutter's books (I think it is item 30 in more exceptional C++) describing a similar situation. I'll bring the book to work tomorrow and I can be more specific.
I tried to find it but am not sure which item you are referring to.
It is a bit unclear to me though, what is the idea here? What are we aiming at templatizing? What will be template, and what will be virtual? It will be good if you could clarified...
Well you wrote earlier:
Ok, this sounds like a situation for traits!!! On general template for operator() or should we go for operator[] (to allow also std::vector) And one specialized template for classes having data() and weight() functions.
So I thought you had a clear idea... I was originally thinking that we should just have a templatized member function, double operator()(T1,T2), in Distance that calculated the distance between vector-types T1 and T2. Then traits or some other template baggage checking method would be nice, if one could get it to call either vector operator()/[] or data+weight depending on the traits. This function would then have to be overloaded in derived classes to calculate the proper distances. Does this work? An alternative would be to templatize the whole class Distance, but then I am not sure how to get any class hierarchy in. In this latter case I guess we have to make everything a template both the vector types and the distance. However the structure for this is messy with both e.g. PearsonDistance? and Pearson and this would require a big but perhaps interesting redesign.
comment:9 Changed 15 years ago by
An alternative which is appealing at first sight is to do this with iterators instead. I think if we have an iterator that returns operator() and unity for weights for some vector types and a derived iterator that returns data() and weight() then all we would have to do with Distance is to have an "double operator()(Iterator&,Iterator&)"
comment:10 follow-up: 11 Changed 15 years ago by
Ok, so we are talking about a templatized pure virtual member function in Distance that is implmented in daughter classes.
I would like to factorize the issues a bit:
- Passing Iterators or Lookups (well any container)?
- Should we support both weighted and non-weighted?
On question 1. I think the alternative are very similar. Iterators solution is perhaps a bit more general. It is more STL-alike, which is appealing. If we are careful it will perhaps even be possible to calculate the distance between 2 std::map
. I like the Iterator
, but then we should aim at implementing iterators for all 1D containers, in other words, also for utility::vector
.
On question 2 an alternative is, of course, NOT to throw away weighted statistics, but we could ignore the non-weighted and just accept that non-weighted statistics is a special case of non-weighted statistics (all w=1
). So why are we struggling with having two implementations? The only reason I can remember is speed (runtime not development). I have concidered to suggest that we should trash MatrixLookup
and all other functionality that are pure non-weighted, but I'm not sure it is a good idea. After all the only design decision we have taken in this project is that "missing values should be handled with weights equal zero not with NaNs?", which sort of implies that we prefer having a non-weighted GSL-style version that is efficient (in runtime). In summary I think we should keep both weighted and non-weighted.
Ok, so what does this imply on this Distance
issue? Well, it means that if you pass DataLookup1d::Iterator
to Distance
then non-weighted statistics should be used. This is here traits enter the stage. I think it should be possible to have one general template for nor-weighted iterator just requiring ForwardIterator? and on more specialized template requring that operator* returns std::pair<double, double>
. The item I'm refering to is item 31 in Sutter's more exceptional C++. There is a section named Applying Traits (starting on page 194 in my version).
Another thing: I think we should try to follow the advice to make public function non-virtual, i.e., separating interface from implementation. The public non-virtual function could of course call a private/protected virtual function.
Yet another thing: If we implement iterators, I think we should change the interface of the Averagers as well. They have functions taking containers should be iterators.
comment:11 Changed 15 years ago by
Replying to peter:
Ok, so we are talking about a templatized pure virtual member function in Distance that is implmented in daughter classes.
I would like to factorize the issues a bit:
- Passing Iterators or Lookups (well any container)?
- Should we support both weighted and non-weighted?
On question 1. I think the alternative are very similar. Iterators solution is perhaps a bit more general. It is more STL-alike, which is appealing. If we are careful it will perhaps even be possible to calculate the distance between 2
std::map
. I like theIterator
, but then we should aim at implementing iterators for all 1D containers, in other words, also forutility::vector
.
I think 1 is nicer because then the classes in statistics (Distance,Averager,...) could be written with an interface that accepts iterators (to std, lookups, utility::vector,...) and hence statistics is more "independent of " classifier. In particular, as more vector-like objects appear ideally nothing has to be changed in statistics to make statistics use of the new vectors. The question then is where do the iterators belong?
On question 2 an alternative is, of course, NOT to throw away weighted statistics, but we could ignore the non-weighted and just accept that non-weighted statistics is a special case of non-weighted statistics (all
w=1
). So why are we struggling with having two implementations? The only reason I can remember is speed (runtime not development). I have concidered to suggest that we should trashMatrixLookup
and all other functionality that are pure non-weighted, but I'm not sure it is a good idea. After all the only design decision we have taken in this project is that "missing values should be handled with weights equal zero not with NaNs?", which sort of implies that we prefer having a non-weighted GSL-style version that is efficient (in runtime). In summary I think we should keep both weighted and non-weighted.
I agree (we made this decision earlier).
Ok, so what does this imply on this
Distance
issue? Well, it means that if you passDataLookup1d::Iterator
toDistance
then non-weighted statistics should be used. This is here traits enter the stage. I think it should be possible to have one general template for nor-weighted iterator just requiring ForwardIterator? and on more specialized template requring that operator* returnsstd::pair<double, double>
. The item I'm refering to is item 31 in Sutter's more exceptional C++. There is a section named Applying Traits (starting on page 194 in my version).
All I am sure of is, unless there are two Distance funtions with different names or signatures, traits is the best way to go: they can only both be "double operator()(Iterator&,Iterator&) without traits if we let if-then-else-ugliness due to poor design come into play and then we are back to slow, poor separation of weighted and unweighted.
Another thing: I think we should try to follow the advice to make public function non-virtual, i.e., separating interface from implementation. The public non-virtual function could of course call a private/protected virtual function.
OK, but we do not do this consistently in yat.
Yet another thing: If we implement iterators, I think we should change the interface of the Averagers as well. They have functions taking containers should be iterators.
Of course, see above.
comment:12 Changed 15 years ago by
Remember that we now have different functions with different signatures for weighted and unweighted in Distance and other places. So the iterator solution with different signatures and without traits: one iterator hierarchy for unweighted containers and one for weighted would provide a nice separation of containers (utility,classifier) from algorithms (statistics), and allow for statistics to be used with future containers without modifications to statistics, without making things more cluttered than they are presently. The nice thing with using traits is that it to the next level: weighted and unweighted in a common fashion. This would be in line with how we have a common hierarchy for weighted and unweighted lookups in classifier. To summarize: I think going to iterators and going to traits address (and hopefully solves) two different problems. Should both be solved? Priorities?
comment:13 Changed 15 years ago by
Milestone: | later → 0.4 |
---|---|
Priority: | minor → major |
This ticket has been splitted into ticket:244 and ticket:245 (and more)
comment:14 Changed 15 years ago by
Resolution: | → wontfix |
---|---|
Status: | new → closed |
Distance has been replaced by template functions (see ticket:250)
ticket:237 is marked as related