Opened 3 years ago

Closed 3 years ago

#958 closed request (fixed)

Read - process - write MT pattern

Reported by: Peter Owned by: Peter
Priority: major Milestone: yat 0.19
Component: utility Version: trunk
Keywords: Cc:

Description

A common pattern when working on large data such as vcf and bam files is to read one entry, process it, and (possibly) write it. The process step might be to alter the data element e.g. adding some annotation or filter out the element.

An single-threaded implementation can easily be done e.g. with std::transfor or std::copy_if and if input is sorted the output will sorted.

A way to multithread is to split the task into 1) reading data 2) process the data 3) write the data and either have three different threads or if the processing involves some heavy lifting, split the processing on multiple workers.

A common case is that the input is sorted and that one wants the output to be sorted as well. It would be nice to have some machinery for with components that are flexible. The components would be: 1) A reader that e.g. reads from file and send the data to a queue for processing 2) Workers that do the processing; they share an input queue that is feed by the reader. Each worker has an output queue in which they feed their products 3) A Reader that has access to the workers n queues. When all workers has produced a product, it compares them and takes the smallest one, then wait for the associated worker to produce a new element, compare all elements, write and so on and forth.

There must be a way for the Reader to the the workers that "Hey I've reached end-of-file, end of input range" and likewise there must be a way for the workers to tell the writer that they are done. Probably feeding the queues with pointers a null pointer can signal end and it's probably better to use some sort of smart pointer than a raw pointer.

Change History (4)

comment:1 Changed 3 years ago by Peter

One crucial element is a mechanism to limit the size of the different buffers/queues. Both Reader and Workers should take a break if their output queue is reaching a certain limit.

comment:2 Changed 3 years ago by Peter

Milestone: yat 0.x+yat 0.19
Type: specificationrequest

comment:3 Changed 3 years ago by Peter

Owner: changed from Jari Häkkinen to Peter
Status: newaccepted

comment:4 Changed 3 years ago by Peter

Resolution: fixed
Status: acceptedclosed

In 4044:

new function multiprocess; closes #958

Note: See TracTickets for help on using tickets.