Opened 7 years ago

A less memory hungry Aligner

Reported by: Owned by: Peter Peter major yat 0.x+ utility

Description

The utility::Aligner class is quite hungry wrt memory. Aligned two sequences of length N and M, respectively you need to create two matrices of size NxM. This becomes a problem quite quickly.

```void operator()(const Matrix& score, Matrix& x)
```

Here are just some ideas that crossed my mind. Are the two matrices really needed? The first Matrix in which score(i,j) describes the similarity between element i in first sequence and element j in second sequence. That could easily be replaced by two ranges and a functor that calculates the element when needed. Similar to how classifier::Kernel_MEV works (see ticket #381). A typical element is accessed three times during the alignment so the slowdown should be too bad as long as kernel function is not too complex.

The second Matrix is more complicated. It holds all partial alignment scores. But looking at the current implementation one notices we are iterating over the matrix row by row:

```for (r =1 to n)
for (c=1 to m)
```

and we only refer to x(r,c), x(r,c-1), x(r-1,c), and x(r-1,c-1), when we have reached row r, we won't need rows [0-r-2] anymore and there is no need to keep these in memory. We only need to keep to rows in memory since [rows+1,n] are not even computed yet.

There is a third matrix too, which holds information about the best path. For every positions in x, it holds a direction value telling whether we achieved x by a diagonal step in dot-matrix or vertical or horizontal step. If we x (and the scores) it's trivial infer this direction matrix since you know x(i,j) and you just need to check what you would get with the three different paths. The whole point with ticket is, however, that we wanna get rid of x so that idea is not super. Another way would be to reaplce the direction matrix with a cigar array similar to the one used in BAM where for example 23D:4H:34D means 24 diagonal steps, 4 horizontal, and 34 diagonal and all that information can be stored in a vector<int> of size 3. If we store a cigar array for every element in current row and previous row it should be possible to figure out what's needed.

Another complexity is that different users want different end element of the alignment, in other words, for global alignment e.g. one returns the x in the right lower corner whereas in local alignment one returns the max(x). That means that in SW one can actually not delete "old" rows of x and cigar matrix but need to keep the one that currently has the best value.

comment:1 Changed 10 months ago by Peter

Rather than replacing the first matrix with two ranges and a function, it would be more flexible to replace it with a functor that has an

```double operator()(size_t i, size_t j) const
```

i.e., it looks like the Matrix and that operator can be implemented such that e.g. one of the ranges being a BAM, taking into account the base quality of each base.

Note: See TracTickets for help on using tickets.