Opened 13 years ago

Closed 13 years ago

#236 closed defect (fixed)

Predictions using Ensemble with feature selection does not work properly

Reported by: Markus Ringnér Owned by: Peter
Priority: critical Milestone: yat 0.4
Component: classifier Version: trunk
Keywords: Cc:

Description (last modified by Markus Ringnér)

EnsembleBuilder? does not use the features selected in predict. The EnsembleBuilder? can use a SubsetGenerator? which keeps track of which features should be used for each subset and consequently for each classifier in the ensemble. However, this is not utilized in the predict member function and hence incorrect features are used.

Change History (11)

comment:1 Changed 13 years ago by Markus Ringnér

Status: newassigned
Version: trunk

comment:2 Changed 13 years ago by Markus Ringnér

Owner: changed from Markus Ringnér to Peter
Status: assignednew

I have now fixed the bug for matrix-based classifiers and the behaviour is as I expect. For kernel-based classifiers the issues are as follows (keep in mind that I have tested nothing with e.g. SVMs)

1) The features used for the respective classifiers are not used for kernel-based predictions. subset_->training_features should probably be used instead of subset_->training_index.

2) The way it now works for matrix-based classifiers is that the test-set can be completely independent of training+validation data and no mapping to any indices in a total original data set (training+validation+test) is not performed. I think this is the way things should be addressed since an EnsembleBuilder? not necessarily is linked to a specific external test.

comment:3 Changed 13 years ago by Peter

Status: newassigned

Ok, I will look into this.

Without looking into the code, I think predictions with ensemble of kernel-based method should work as follows.

1) Exactly like matrix-based, i.e., user passes a matrix with test data.

2) If cross-testing or similar approach is used it may be preferable to pass a kernel, since the kernel is already at hand and one thereby can avoid the expensive calculations of the kernel. Note, this is only useful when no feature selection is done.

I get back with more discussion when I have looked into the code.

comment:4 Changed 13 years ago by Peter

After looking into the code, I think this needs to be rewritten. I dont like the dynamic_cast. It is ugly.

A quick fix would be to replace EnsembleBuilder::predict(const DataLookup2D&... with EnsembleBuilder::predict(const MatrixLookup&..., EnsembleBuilder::predict(const MatrixLookupWeighted&..., and EnsembleBuilder::predict(const KernelLookup2D&.... The argument against that is that caller has to know what type of Lookup he holds, i.e., he cant pass a DataLookup2D&.

Another solution would be to move the diversity of code into a virtual function. What we need here is a function that creates a new Lookup based on a subset of features. I think this type of function has existed but was removed earlier this year (cant remember if we had a good reason for removing it). The function still exists in KernelLook? as

const KernelLookup* selected(const std::vector<size_t>& index) const;

Declaration in basclass should look something like

const DataLookup2D* selected(const std::vector<size_t>& index) const=0;

and then this lookup could be passed to SupervisedClassifier::predict. I think this is the way to go but would love to hear some "second opinion".

comment:5 Changed 13 years ago by Markus Ringnér

Description: modified (diff)

I agree with the option with dynamically allocated lookups with diversity enabled through a virtual function. One thing I was concerned about was the responsibility of these dynamically allocated objects. But looking at the selected function in KernelLookup?, I think it is fine to keep it that way: the caller of selected is responsible for the destruction of the returned object. So if I qualify as a "second opinion" let us go this way.

Left-overs: 1) Is selected a good name given all other undocumented terminology in classifier? The way it works is similar to a clone method but the result is a sub-matrix/kernel and not a clone. Something like sublookup, create_lookup, ... would be nicer, or?

2) We have copy constructors of Lookups taking arguments to only copy selections of row and column indices. Should the "selected" method be a set of methods that map onto these, so there are method corresponding to each of these sub copy constructors returning a dynamically allocated new lookup object, allowing also to select columns?

3) I have not looked at the details so perhaps it is obvious but why is the selected in KernelLookup? not implemented using the copy constructor taking a vector to select features?

comment:6 Changed 13 years ago by Markus Ringnér

When we agree on this this ticket should really be forked. One new ticket for Lookup functionality and this ticket should just say that EB::predict should be implemented in terms of this functionality once the new ticket is closed.

comment:7 in reply to:  5 ; Changed 13 years ago by Peter

Replying to markus:

I agree with the option with dynamically allocated lookups with diversity enabled through a virtual function. One thing I was concerned about was the responsibility of these dynamically allocated objects. But looking at the selected function in KernelLookup?, I think it is fine to keep it that way: the caller of selected is responsible for the destruction of the returned object. So if I qualify as a "second opinion" let us go this way.

Left-overs: 1) Is selected a good name given all other undocumented terminology in classifier? The way it works is similar to a clone method but the result is a sub-matrix/kernel and not a clone. Something like sublookup, create_lookup, ... would be nicer, or?

I agree selected is not a good name; sublookup is misleading though. It is not a sublookup in kernel case, but a kernel based on a subset of features. The name should reflect this. Something with features, perhaps? I cant come up with something good right now.

2) We have copy constructors of Lookups taking arguments to only copy selections of row and column indices. Should the "selected" method be a set of methods that map onto these, so there are method corresponding to each of these sub copy constructors returning a dynamically allocated new lookup object, allowing also to select columns?

For the moment I vote NO for this. Perhaps later. I don't see the need for it right now, and prefer keep the interface down.

3) I have not looked at the details so perhaps it is obvious but why is the selected in KernelLookup? not implemented using the copy constructor taking a vector to select features?

Because in a KernelLookup? is a lookup into a Kernel. A Kernel is a matrix in which element K_ij is scalar product between sample i and j, in other words K_ij = K(x_i, x_j). This means that selecting features is not a simple lookup (like in matrix case), but each element must be recalculated.

comment:8 Changed 13 years ago by Peter

Created a new ticket:236 that takes care of the DataLookup2D functionality.

comment:9 Changed 13 years ago by Peter

Correction: should be ticket:238

comment:10 in reply to:  7 Changed 13 years ago by Markus Ringnér

Replying to peter:

Replying to markus:

I agree with the option with dynamically allocated lookups with diversity enabled through a virtual function. One thing I was concerned about was the responsibility of these dynamically allocated objects. But looking at the selected function in KernelLookup?, I think it is fine to keep it that way: the caller of selected is responsible for the destruction of the returned object. So if I qualify as a "second opinion" let us go this way.

Left-overs: 1) Is selected a good name given all other undocumented terminology in classifier? The way it works is similar to a clone method but the result is a sub-matrix/kernel and not a clone. Something like sublookup, create_lookup, ... would be nicer, or?

I agree selected is not a good name; sublookup is misleading though. It is not a sublookup in kernel case, but a kernel based on a subset of features. The name should reflect this. Something with features, perhaps? I cant come up with something good right now.

I added my thoughts on this to #238

2) We have copy constructors of Lookups taking arguments to only copy selections of row and column indices. Should the "selected" method be a set of methods that map onto these, so there are method corresponding to each of these sub copy constructors returning a dynamically allocated new lookup object, allowing also to select columns?

For the moment I vote NO for this. Perhaps later. I don't see the need for it right now, and prefer keep the interface down.

I agree with this no. I did not think about feature selection being different from row selection in the Kernel case.

3) I have not looked at the details so perhaps it is obvious but why is the selected in KernelLookup? not implemented using the copy constructor taking a vector to select features?

Because in a KernelLookup? is a lookup into a Kernel. A Kernel is a matrix in which element K_ij is scalar product between sample i and j, in other words K_ij = K(x_i, x_j). This means that selecting features is not a simple lookup (like in matrix case), but each element must be recalculated.

OK, now I get it.

comment:11 Changed 13 years ago by Peter

Resolution: fixed
Status: assignedclosed

([868]) fixes ticket:236

Note: See TracTickets for help on using tickets.