Opened 4 months ago

Last modified 4 months ago

#958 new specification

Read - process - write MT pattern

Reported by: Peter Owned by: Jari Häkkinen
Priority: major Milestone: yat 0.x+
Component: utility Version: trunk
Keywords: Cc:


A common pattern when working on large data such as vcf and bam files is to read one entry, process it, and (possibly) write it. The process step might be to alter the data element e.g. adding some annotation or filter out the element.

An single-threaded implementation can easily be done e.g. with std::transfor or std::copy_if and if input is sorted the output will sorted.

A way to multithread is to split the task into 1) reading data 2) process the data 3) write the data and either have three different threads or if the processing involves some heavy lifting, split the processing on multiple workers.

A common case is that the input is sorted and that one wants the output to be sorted as well. It would be nice to have some machinery for with components that are flexible. The components would be: 1) A reader that e.g. reads from file and send the data to a queue for processing 2) Workers that do the processing; they share an input queue that is feed by the reader. Each worker has an output queue in which they feed their products 3) A Reader that has access to the workers n queues. When all workers has produced a product, it compares them and takes the smallest one, then wait for the associated worker to produce a new element, compare all elements, write and so on and forth.

There must be a way for the Reader to the the workers that "Hey I've reached end-of-file, end of input range" and likewise there must be a way for the workers to tell the writer that they are done. Probably feeding the queues with pointers a null pointer can signal end and it's probably better to use some sort of smart pointer than a raw pointer.

Change History (1)

comment:1 Changed 4 months ago by Peter

One crucial element is a mechanism to limit the size of the different buffers/queues. Both Reader and Workers should take a break if their output queue is reaching a certain limit.

Note: See TracTickets for help on using tickets.