Opened 9 years ago

Closed 8 years ago

#746 closed request (fixed)

non-const access to data in BamRead

Reported by: Peter Owned by: Peter
Priority: major Milestone: yat 0.11
Component: omic Version: trunk
Keywords: Cc:

Description

I want to modify my bam reads. Everything in core is modifiable via core(), but data in data is not accessible via the interface other than some const functions. Specifically I want to modify quality, but should think of modifying any part of the void* data. For the other variables (int l_aux, data_len, m_data), I see no need to allow modifications, ATM.

Change History (15)

comment:1 Changed 9 years ago by Peter

(In [2985]) refs #746. added function to modify one element in sequence

comment:2 Changed 9 years ago by Peter

(In [2987]) add test for modify element in sequence. refs #746

comment:3 Changed 8 years ago by Peter

Status: newassigned

comment:4 Changed 8 years ago by Peter

(In [3026]) add function to modify a base quality (BamRead?). refs #746

comment:5 Changed 8 years ago by Peter

The uint8_t* data is concatenated from the following strings:

  • name
  • cigar
  • sequence
  • quality
  • aux

where we currently (in trunk) can modify the cigar, a sequence element (not the size), and quality element (not size). There is no way to change the name, modify size of sequence (or quality), or modify the aux [sequence]. There are several ways to allow these things. One would be to give full access to the data pointer, which probably is a bad idea because I think the fact that these five strings are concatenated into one jumbo data array should be considered an implementation detail and hidden from public interface. We should remember that since data is concatenated into data*, modifying name etc implies copying the entire array into a new array (see the cigar function for example). We could have a big function taking name, cigar, etc which updates the data array accordingly, but I'm not sure how to interface that. Should one take vector<uint8_t> or should one take uint8_t* with an int telling length of array. Difficult to say without having any use cases from real world. I can see that I would like to modify aux, so I think we should have a function doing that similar to the cigar function. Then I could see that one would like to trim the sequence, but doing that for sure means trimming the quality as well - and changing the length of sequence implies the cigar is invalid. Just thinking loud here, but I think the conclusion is that I add an aux function and then wait closer to release to see if learn what would make sense. Feedback and suggestions are obviously most welcome here.

comment:6 Changed 8 years ago by Peter

Bam API has append and del functions, so probably better to wrap them than a single set function.

comment:7 Changed 8 years ago by Peter

(In [3028]) functions to access, append, and remove aux field. refs #746

comment:8 Changed 8 years ago by Peter

(In [3029]) refs #746. Add docs and throw in aux_del if tag is absent (rather than undefined behaviour)

comment:9 Changed 8 years ago by Peter

(In [3030]) refs #746. avoid false positive throws

comment:10 Changed 8 years ago by Peter

The aux interface is almost complete. There is not way to access a specific tag and modify in place, but one has to access, remove and re-append modified data. A bit cumbersome, but at least doable. The other thought I had implementing this was if one should have type specific get and append functions. For get I decided no because it is easy to use bam_aux2? functions to convert uint8_t* to corresponding type. For the append case, I'm still considering adding append functions for supported types. The for argument is that it is cumbersome to convert the data to uint8_t* and is easier to just add pass a double or whather type the data is. Also it is redundant to calculate the length of the data ant the argument char that describes which type data is. The against argument is that int8_t and char are typically same type on systems and can therefore not be overloaded while in SAM spec char is a printable character and `int8_t' is a one-byte integer. Also tt's a bit bloating to add a function for each type but that could shaped up by a templatized interface. It's easier to add later than to remove later so I'm probably going for "wait and see..." option.

comment:11 Changed 8 years ago by Peter

(In [3031]) new function BamRead::sequence(2). refs #746

comment:12 Changed 8 years ago by Peter

(In [3032]) refs #746. Simplify cigar function using logic similar to sequence(2). Avoid bug as pointers to data get invalidated when calling reserve. Forgot to set data_len and avoid compiler warning using char as index to array

comment:13 Changed 8 years ago by Peter

Resolution: fixed
Status: assignedclosed

comment:14 Changed 8 years ago by Peter

Resolution: fixed
Status: closedreopened

Since everything else can be modified, I think we should allow modifying the name as well.

comment:15 Changed 8 years ago by Peter

Resolution: fixed
Status: reopenedclosed

(In [3055]) closes #746. New function BamRead::name(1)

Note: See TracTickets for help on using tickets.