Opened 14 years ago

Closed 13 years ago

#248 closed enhancement (wontfix)

avoid using GSL sort algo in vector::sort

Reported by: Peter Owned by: Jari Häkkinen
Priority: major Milestone: yat 0.4
Component: utility Version: trunk
Keywords: Cc:

Description

There is a bug in underlying GSL sort (I know GSL people call it a feature), which we use in utility::vector::sort

Either we should use std::sort or simply remove the function and let the user do the call to

std::sort(vec.begin(), vec.end())

Change History (6)

comment:1 Changed 14 years ago by Peter

gsl replaced with std in [889]

comment:2 Changed 14 years ago by Markus Ringnér

There are also some functions (sort_index, sort_smallest_index, sort_largest_index) based on gsl functions. Probably they have the same bug (never returning for a vector with nan in it), but I have not tested. Nevertheless, we should perhaps check if there are std functions that provide the same functionality (without making a slow std version?).

comment:3 Changed 14 years ago by Peter

Can you (even in theory) do it much faster than that?

template<typename T> 
sort_index(T1 first, T1 last, vector<size_t>::iterator result){
  T iter(first);
  vector<pair<T, size_t> > pair_vec;
  pair_vec.reserve(distance(first, last));
  for ( ; iter!=last; ++first)
    p.push_back(pair<T, size_t>(iter, distance(first, iter)));
  sort(pair_vec.begin(), pair_vec.end());
  for (vector<pair<T, size_t> >::const_iterator i(pair_vec.begin());
       i!=pair_vec.end(); ++i, ++result)
    *result = i->second;  
}

comment:4 Changed 14 years ago by Markus Ringnér

Yes it can be much faster. There are at least two things that I would expect being different in a C numerical library such as gsl.

1) Using a dynamically allocated vector inside the algorithm introduces overhead. pair_vec is filled twice and objects to fill it are constructed twice: first when the call to reserve is made and secondly in the loop over push_back calls (see A).

2) Also once the sort has been done in linear time there is another loop in linear time to copy the result into 'result'. With a different design (no vector of pairs) in which the sorted indices are stored in contiguous memory the last copy could have been done in constant time with memcopy (see B).

So yes, I think we are running a risk of replacing libraries such as gsl with our slower std-based algorithms.

Boost has tried to address both of these issues:

A) Boost.array. There have been many versions of classes that are safer than ordinary C arrays but introduce no overhead. Boost.array seems to be the consensus version, incidentally written by Josuttis (http://www.boost.org/doc/html/array.html).

B) Boost.type_traits. One of the traits this library provides is has_trivial_assign<T>. For types for which has_trivial_assign is true_type, a call to assignment operators can safely be replaced by a call to memcopy. As an example this library provides a faster std::copy algorithm. (http://www.boost.org/doc/html/boost_typetraits.html).

A final comment so things end up in this trac:

I addressed some of these issues in an e-mail about std::vector and ordinary arrays in the context of calling C library functions. The message was: we should keep in mind that assignment , copy and more are slow for std::vector (linear time when it could have been done in constant time: in cases where static size arrays are fine and dynamically allocated vectors are not needed). The only way to get constant time with std::vector is swap, but it is not applicable to many of the things we want to do.

comment:5 in reply to:  4 Changed 14 years ago by Markus Ringnér

1) Using a dynamically allocated vector inside the algorithm introduces overhead. pair_vec is filled twice and objects to fill it are constructed twice: first when the call to reserve is made and secondly in the loop over push_back calls (see A).

I confused reserve with resize. A call to reserve does not construct objects. Nevertheless, dynamic allocation introduces overhead that should not be in a (low-level) sort function.

2) Also once the sort has been done in linear time there is another loop in linear time to copy the result into 'result'. With a different design (no vector of pairs) in which the sorted indices are stored in contiguous memory the last copy could have been done in constant time with memcopy (see B).

The sort is of course O(N*log(N)) and not linear, but the point is that the copy could be made in constant time.

comment:6 Changed 13 years ago by Jari Häkkinen

Resolution: wontfix
Status: newclosed

We have decided to keep GSL sort for indexes (vector::sort_index, ...). Using theses sorts may cause the program to never exit.

Note: See TracTickets for help on using tickets.