Opened 13 years ago

Closed 13 years ago

#287 closed specification (fixed)

avoid using dynamic_cast in Subset and SupervisedClassifer

Reported by: Peter Owned by: Peter
Priority: critical Milestone: yat 0.4
Component: classifier Version: trunk
Keywords: Cc:

Description (last modified by Peter)

Daughter tickets: #268 #309 #310 #311 #312 #313 #314 #315 #316 #234 #317 #318 #333

There are two major problems in the EnsembleBuilder structure, and they both arise from that we have not succeeded making a common interface for all DataLookup2D inherited classes. This is most obvious in Subset, where we generate the subset lookup using three cases, one case for each of the Lookups. Another problem is that we try to make a common interface for SupervisedClassifier; but they are different. Significantly different. The biggest problem is that predict functions. At time being, SupervisedClassifer has a predict function taking DataLookup2D, which was a compromise between SVM who wants a KernelLookup and NCC who wants a MatrixLookup. Passing a MatrixLookup to a kernel-based method like SVM would make sense and should probably be supported (but for now it is not). Passing a KernelLookup to a data-based method such as NCC does not make sense.

To solve these problems I have sketched a possible solution. I'm happy about suggestions on improvements or critique. But before you simply say this is shit, take a look at the current shit.

If we agree on doing something like this, this ticket should be broken down to several smaller ones.

I start with some code showing how I suggest it to be:

ifstream is("data.txt");
MatrixLookup ml(is);
is.close();
is.open("labels.txt");
Target target(is);
is.close();
NCC<pearson_vector_distance_tag> ncc;
CrossValidationSampler sampler(target, target.size(), target.size());
EnsembleBuilder<NCC<pearson_vector_distance_tag>,MatrixLookup> 
  eb(ncc, ml, sampler);
eb.build();
...

The main idea is to templatize EnsembeBuilder on type of classifier and type of data, so the constructor will like

template<class SC, class Data>
EnsembleBuilder(const SC& prototyp, const Data& data, const Sampler&)

First thing to notice is that, I have removed data and target from NCC constructor. This is to avoid having data both expicitely passed to EB and also contained in NCC. This implies that data/target are removed from constructor interaface (and make_classifier) and added to train() interface. An alternative would be to not pass any data to EB, but get the data from NCC. However that that implies get functions for each data type (supported in that classifier).

So NCC needs two overloaded train functions

void train(const MatrixLookup&, const Target&);
void train(const MatrixLookupWeighted&, const Target&);

and similarily SVM have one train function

void train(const KernelLookup&, const Target&);

I removed the returning bool from train. That was old style and I think we should throw an exception if training is not succesful.

So let's look inside EnsembleBuilder. As mentioned above we are templatizing on Classifier type and data type. Everything that used to be DataLookup2D is now on T. Also we have a member Subset, which we need to change to be Subset<T>. I think this will solve the problems with dynamic_cast in Subset. Otherwise EnsembleBuilder works pretty much as before. Signaure of make_classifer is changed so we provide data and target through training rather than creation.

If we should keep SupervisedClassifier, it should be bias more towards matrix based versions. For example, predict functions should take MatrixLookup and weighted version separately. KernelLookup should not be supported here, but be an extension for SVM.

Also I think we should remove DataLookup2D. In principle, one never really wanna use that signature to pass around objects as DataLookuo2D&. Here are two questions. First, what signature is useful. It might be useful to have a common interface for MatrixLookup and MatrixLookupWeighted. The other thing is some functionality that is useful that now is hidden in DataLookup2D that is all the reference counting and related stuff. But I think it is bad to let !KernelLookup2D inherit from DataLookup2D only to employ some nice members in DataLookup2D. It probably makes more sense to to lift out that functionality and put it in a supporting class.

Any questions, ideas, opinions, suggestions?

Change History (14)

comment:1 Changed 13 years ago by Peter

Description: modified (diff)

comment:2 Changed 13 years ago by Peter

Description: modified (diff)

comment:3 in reply to:  description ; Changed 13 years ago by Markus Ringnér

Regarding the Lookup part.

1) I think making EnsembleBuilder? a template on Lookup is very good.

2) For SubsetGenerator? I have not thought about the consequences yet, but Peter is more familiar with it than me.

3) When 1-2 are implemented, should the base class go? I agree that if only MatrixLookup? and MatrixLookupWeighted? were around then they would perhaps benefit from having a common base class. Together with KernelLookup? it is not so clear and instead it is probably better if the common things could be separated out into another class. However, if there is a case where a single function could take either a MatrixLookup? or a MatrixLookupWeighted? and have a common implementation they could have a common base class. Is there? It seems not to be the case and it is better to have two functions (weighted unweighted) then one function with an if-else and two internal implementations.

Regarding SupervisedClassifer?.

1) I think it makes better sense to let EnsembleBuilder? handle the data. It is then also natural to move the data argument from the classifier constructor to its train method.

2) If we are planning to support matrix-based train and predict in SVM then SupervisedClassifier? could be a common interface for matrix-based classifiers and EnsembleBuilder? could utilize that interface withou making EnsembleBuilder? a template on classifier.

3) Even if we do not follow two and make EnsembleBuilder? also a template on classifier, it may still make sense to keep a SupervisedClassifier? base class to allow some other future ensemble class to contain a mixture of classifier types (all matrix-based).

comment:4 in reply to:  3 Changed 13 years ago by Peter

Replying to markus:

Regarding the Lookup part.

1) I think making EnsembleBuilder? a template on Lookup is very good.

2) For SubsetGenerator? I have not thought about the consequences yet, but Peter is more familiar with it than me.

My feeling is that SubsetGenerator? is the easy part.

3) When 1-2 are implemented, should the base class go? I agree that if only MatrixLookup? and MatrixLookupWeighted? were around then they would perhaps benefit from having a common base class. Together with KernelLookup? it is not so clear and instead it is probably better if the common things could be separated out into another class. However, if there is a case where a single function could take either a MatrixLookup? or a MatrixLookupWeighted? and have a common implementation they could have a common base class. Is there? It seems not to be the case and it is better to have two functions (weighted unweighted) then one function with an if-else and two internal implementations.

There are three areas for the base class. Three questions to solve. First, the are common intrinsic functionality such as reference counting. Could easily be encapsulated in a class. Second, it's got a virtual interface with training_data and so on. In principle we could just let that be as it is. Third, the DataLookup2D is used in DataLookup1D. DataLookup1D is not good as it is, because one would like to get the weighted version from MatrixLookupWeighted. Could we replace the 1D Lookups with iterators?

Regarding SupervisedClassifer?.

1) I think it makes better sense to let EnsembleBuilder? handle the data. It is then also >natural to move the data argument from the classifier constructor to its train method.

So what you are saying is that we should do SupervisedClassifier? a pure interface. No members. Then it is up to every daughter class whether they wanna store training data and target.

2) If we are planning to support matrix-based train and predict in SVM then SupervisedClassifier? could be a common interface for matrix-based classifiers and EnsembleBuilder? could utilize that interface withou making EnsembleBuilder? a template on classifier.

I think we should support matrix based predict for SVM, because it is convenient (and expected). But this is aimed for pure testing or even in a future we hope to support that you save you svm on a memory stick, generate some new data, and predict new data. It is not aimed for situations e.g. cross-validation. In many situations, especially when dealing with high-D data, the most expensive calculation is the calculating the kernel. Therefore, it is crucial to do this once in the beginning of all looping of cross-validation, cross-testing, permutation test.

I do not like the idea of having matrix based training. Currently the SVM talks with the KernelLookup? and only the KernelLookup?. Training on a matrix you still have to decide what kind of KernelFunction you want, what kind of Kernel (Kernel_SEV or Kernel_MEV) you want. If the gain is huge in doing so, it is probably possible to make it happen, but it was not my intention.

3) Even if we do not follow two and make EnsembleBuilder? also a template on classifier, it may still make sense to keep a SupervisedClassifier? base class to allow some other future ensemble class to contain a mixture of classifier types (all matrix-based).

Yes, I think we should keep it, and perhaps SVM should not inherit. We have to think about the signatures of predict functions. DataLookup2D is clearly not good. Keeping the concrete classes (MatrixLookup and MatrixLookupWeighted?) separated is tempting. But it implies that one gotta templatize EB on concrete data type, which means one can not mix weighted and unweighted data (in training and testing). I think the choice here, strongly depends on how we manage to solve the DataLookup1D/Iterator problem above.

comment:5 Changed 13 years ago by Markus Ringnér

The status of this ticket is that Peter has promised to split it up into a set of more specified smaller entities that we all can comment upon and implement.

comment:6 Changed 13 years ago by Peter

Description: modified (diff)

added some daughter tickets - more to come

comment:7 Changed 13 years ago by Peter

Description: modified (diff)

comment:8 Changed 13 years ago by Peter

Description: modified (diff)

added #333 in list of daughter tickets

comment:9 Changed 13 years ago by Markus Ringnér

As of [1160] there are no dynamic_casts in SupervisedClassifier. The only dynamic_casts left in the entire yat are now in SubsetGenerator.h.

comment:10 in reply to:  9 Changed 13 years ago by Markus Ringnér

Replying to markus:

The only dynamic_casts left in the entire yat are now in SubsetGenerator.h.

I made a mistake and only checked .h files. In relation to Lookups there are dynamic_casts still in SubsetGenerator.h and KernelLookup.cc.

comment:11 Changed 13 years ago by Peter

Removed some dynamic_cast in Kernel i [1163].

comment:12 Changed 13 years ago by Peter

In order to remove casts in SubsetGenerator and KernelLookup we need to replace function

const DataLookup2D& data(void) const;

in Kernel. Instead we would have two function one for MatrixLokup and one for MatrixLookupWeighted. Should we return const& or const pointers?

The thing is that when Kernel is weighted, there is no MatrixLookup available, so we need to return NULL or throw an exception. Opinions?

comment:13 Changed 13 years ago by Peter

Implemented in [1165]. I chose to return const& to be clear that Kernel owns the Lookup.

comment:14 Changed 13 years ago by Peter

Description: modified (diff)
Resolution: fixed
Status: newclosed

List of daughter tickets will not be extended.

Note: See TracTickets for help on using tickets.