Opened 16 years ago

Closed 16 years ago

#250 closed enhancement (fixed)

The new vector_distance structure should replace the old Distance structure.

Reported by: Markus Ringnér Owned by: Markus Ringnér
Priority: major Milestone: yat 0.4
Component: classifier Version:
Keywords: Cc:

Description

NCC, IGP and perhaps other things use the old Distance classes. They should use vector_distance instead.

Once yat does not use Distance internally these classes shoud be removed.

Change History (16)

comment:1 Changed 16 years ago by Markus Ringnér

Status: newassigned

comment:2 Changed 16 years ago by Markus Ringnér

[898] contains a sketch for a solving this.

We want the distance calculation functions to be completely general with respect to both container types and distance measures. This is what vector_distance provides. On the other hand classes such as NCC does not need this. NCC only needs the distance measure flexibility but applies the distance measure to well-defined containers such as DataLookupWeighted1D. To summarize, NCC does not need complete general access to distance measures between container types.

So vector_distance_ptr.h provides typedefs to vector_distance functions templatized only on the distance measure. If the restriction to one type of container is too restrictive, one could imagine a struct templatized on the distance measure provides pointers to a set of vector_distance functions. Any comments on this solution? However keep in mind that alternative solutions must make sure to initialize and use the correct template specializations.

comment:3 Changed 16 years ago by Peter

Here is a simple example how we could keep the old structure. I'm not checking in this to the repo because may be a step backwards. I have to look more detailed into Markus's solution before I know what I think.

#ifndef _theplu_yat_statistics_distance_
#define _theplu_yat_statistics_distance_

namespace theplu{
namespace yat{

	namespace utility {
		class vector;
	}
	namespace classifier {
		class DataLookup1D;
	}

namespace statistics{

	///
	/// @brief Interface class for calculating distances between arrays.
	///
	class Distance
	{
	public:

		///
		/// @brief The default constructor
		///
		Distance(void);

		///
		/// @brief The destructor
		///
		virtual ~Distance(void);

		///
		/// @return distance
		///
		template <class Iter1, class Iter2>
		double operator()(Iter1, first1, Iter1 last1, Iter2 first2) const;
		{
			reset();
			for ( ; first!=last1 ; ++first1, ++first2 )
				add(*first1, *first2);
			return result();
		}

	protected:
		mutable Averager a_;

	private:
		virtual reset(void) const { a_.reset(); }
		virtual add(double x, double y) const { a_.add(x,y); }
		virtual result(void) const=0;

	};

} // statistics namespace
} // yat namespace
} // theplu namespace 

#endif

comment:4 Changed 16 years ago by Markus Ringnér

I will try to outline my thoughts on distances and iterators.

Distances between two vectors should be fast and I think the fastest way is to try to get them done mimicking the standard library algorithms. vector_distance is attempt at this. So that things like

  double d=statistics::vector_distance(a.begin(),
      a.end(),b.begin(),statistics::euclidean_vector_distance_tag())

is now supported. The names were never good (end even worse now if we make them work when a and b are lists etc.). Surely, the structure needs some polishing but I find it nice if we could get algorithms using iterators to work like in std.

The major problems with this structure comes when passing things to another class such as NCC. Alternatives, I see:

1) NCC is templatized on distance.

2) It would have been nice if NCC could have kept a reference to a vector_distance_tag. However, a method of distinguishing reference types with certainty for partial specialization remains unaddressed in C++ according to Boost programmers. So I guess we will not make this happen.

3) Getting an inherited structure similar to the old one work. It think it should utilize vector_distance, but I do not really like it because it would mean keeping two parallel structures. The simple approach:

class MyDistance
{
    public:
      template <class Iter1, class Iter2>
      double operator()(Iter1, first1, Iter1 last1, Iter2 first2) const;
      {
	return statistics::vector_distance(first1,last1,first2,
                              my_vector_distance_tag());   
      }
};

does not work because since operator() would have to be virtual, we would disobeying "no virtual template functions". But maybe there is an inherited structure that can make use of vector_distance? Either way an inherited structure means slower distance calculations.

4) The last approach is some sort of function adaptor. I made a very simple one with a pointer to the specialization of the template function (pointers to functions of course requires all template arguments to be of a specified type, otherwise there is no address).

I like 4) as std uses a lot of adaptors to make use of their algorithms, so it would be nice if we could design a really nice adaptor for this.

comment:5 Changed 16 years ago by Peter

1) I don't like it.

3) I guess my approach goes under this item. I like it because it is simple and very easy to extend. I have not compiled my code though.

4) Could be the way to go, but I'm not convinced yet. From your comments it sounds like 4) is much faster than 3). Why is that?

comment:6 in reply to:  5 Changed 16 years ago by Markus Ringnér

4) Could be the way to go, but I'm not convinced yet. From your comments it sounds like 4) is much faster than 3). Why is that?

Because decisions are made at compile-time, not run-time.

comment:7 in reply to:  5 Changed 16 years ago by Markus Ringnér

Replying to peter:

1) I don't like it.

I have my concerns too.

3) I guess my approach goes under this item. I like it because it is simple and very easy to extend.

The other structure (4) is also very easy to extend (=adding new distance measures requires the minimal, two functions, plus a struct with a typedef). However to be able to pass things to classes etc. a nice set of adaptors have to exist (these are implemented once and for all: no changes are required when new distance measures are added). I will have to read some more on adaptors to understand how to do it nicely (there are some nice concepts in Boost that perhaps could be mimicked if understood).

I have not compiled my code though.

I guess there will be some things to think about when it comes to handling both weighted and unweighted?

comment:8 Changed 16 years ago by Peter

Ok, if we can make the adapters good looking (I don't like function pointers), let's go for your solution.

I have one concern though (independent of approach). What if Iter1 is non-weighted and Iter2 is weighted? Is that allowed? Will it compile? Will it work as expected? How do we want it to work?

comment:9 Changed 16 years ago by Peter

Owner: changed from Markus Ringnér to Peter
Status: assignednew

Yes, but my question was whether we should keep non-weighted euclidean as himself once defined it or if we could have a normalizing factor n. I think we go for not modifying the definition of non-weighted Euclidean distance - else speak up.

comment:10 Changed 16 years ago by Peter

Owner: changed from Peter to Markus Ringnér

sorry I thought I was in ticket:253

comment:11 in reply to:  8 ; Changed 16 years ago by Markus Ringnér

Replying to peter:

Ok, if we can make the adapters good looking (I don't like function pointers), let's go for your solution.

It is widely used (through typedefs) but we could wrap them in a struct or class to hide nasty details.

I have one concern though (independent of approach). What if Iter1 is non-weighted and Iter2 is weighted? Is that allowed? Will it compile? Will it work as expected? How do we want it to work?

I thought about it when I made the example but did not find time to go through the consequences. It is dangerous to only check the trait of one of the iterators as is done now. But I did not want to restrict to one Iter template parameter: now we can use 2 different containers as long as both are weighted or both are unweighted.

I think if one of them is weighted and one of them is not it goes to a well-defined function but the specialization of it will not compile, as for example the add in the Averagers will call functions in Iter that do not exist (maybe some combinations exist now when IteratorWeighted? operator* returns a double just as Iterator, but no more when this changes).

comment:12 Changed 16 years ago by Markus Ringnér

In [901] and [902] I added a K-nearest neighbor classifier templatized on distance. As a test to see how this approach would work. It was very easy to produce the code and I like the functionality. What are the disadvantages?

comment:13 in reply to:  11 Changed 16 years ago by Markus Ringnér

I have one concern though (independent of approach). What if Iter1 is non-weighted and Iter2 is weighted? Is that allowed? Will it compile? Will it work as expected? How do we want it to work?

I thought about it when I made the example but did not find time to go through the consequences. It is dangerous to only check the trait of one of the iterators as is done now. But I did not want to restrict to one Iter template parameter: now we can use 2 different containers as long as both are weighted or both are unweighted.

There is a nice discussion about this in the motivation for the Boost Concept Check Library:

http://www.boost.org/libs/concept_check/concept_check.htm

I think we should go for specifying requirements in the documentation and if we want something better then use Boost (because we do not want to make an implementation of this checking ourselves).

comment:14 Changed 16 years ago by Markus Ringnér

We have decided to templatize classes on Distance. In [925] this has been done for NCC and IGP.

Now, within yat, the old Distance classes are only used in distance_test.cc. Should I remove them from yat?

comment:15 Changed 16 years ago by Markus Ringnér

Status: newassigned

comment:16 Changed 16 years ago by Markus Ringnér

Resolution: fixed
Status: assignedclosed

The old Distance structure is removed in [931].

Note: See TracTickets for help on using tickets.