Opened 6 years ago

Closed 5 years ago

Last modified 5 years ago

#533 closed task (fixed)

Add secondary analysis section to Reggie

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: critical Milestone: Reggie v2.16
Component: net.sf.basedb.reggie Keywords:
Cc:

Description (last modified by Nicklas Nordborg)

This is the master ticket for adding secondary analysis registration functionality to Reggie. The secondary analysis is the steps done from sequencing down to expression values have been generated, including demux and alignment against a reference genome.

Note! Primary analysis is the base calling performed by the Illumina software during the sequencing.

The pipeline will be something like this. See the other tickets (to be created) for more information about each step:

  1. (#545) Register sequencing as ended. Part of the "Library preparation wizards" section and done by someone in the lab.
  1. (#546) Confirm sequencing as completed. First wizard in the "Secondary analysis wizards" section. Used to decide if the sequenced data is ok or not. If ok, continue with demuxing, otherwise flag pools for re-sequencing.
  1. (#547) Start demux and merge. This wizard starts the demux and merge operations.
  1. (#548) Register demux and merge as ended. At the end we have one "MergedSequences" item for each "Library" from the flow cells that was sequenced. A count of the number of reads for each library must be recorded and is used to determine if the library needs to be re-sequenced or not. FASTQ files for each library are stored on the server.
  1. (#593) Start filtering and alignment. Bowtie and TopHat is used to first filter and then align against a pre-defined set of transcripts.
  1. (#595) Register filtering and alignment as ended. At the end we have one "AlignedSequences" item for each "Library" from the flow cells that was sequenced. BAM files for each library are stored on the server.

The remaining issues are postponed to a later release.

  1. Start feature extraction. Cufflinks is used to calculate expression values.
  1. Register feature extraction. At the end we have one "RawBioAssay" item for each "Library" from the flow cells that was sequenced. FPKM files are uploaded to BASE and imported into the database.

Attachments (2)

secondary-analysis-overview-v1.pdf (72.6 KB) - added by Nicklas Nordborg 6 years ago.
Overview PDF v1
secondary-analysis-overview-v2.pdf (71.5 KB) - added by Nicklas Nordborg 6 years ago.
Second version of the overview

Download all attachments as: .zip

Change History (47)

comment:1 Changed 6 years ago by Nicklas Nordborg

Milestone: Reggie v2.xReggie v2.15

comment:2 Changed 6 years ago by Nicklas Nordborg

Description: modified (diff)
Summary: Add primary analysis section to ReggieAdd secondary analysis section to Reggie

comment:3 Changed 6 years ago by Nicklas Nordborg

Description: modified (diff)

comment:4 Changed 6 years ago by Nicklas Nordborg

Description: modified (diff)

comment:5 Changed 6 years ago by Nicklas Nordborg

Description: modified (diff)

comment:6 Changed 6 years ago by Nicklas Nordborg

Description: modified (diff)

Changed 6 years ago by Nicklas Nordborg

Overview PDF v1

comment:7 Changed 6 years ago by Nicklas Nordborg

(In [2196]) References #533: Add secondary analysis section to Reggie

Re-arranged the index page to make room for the "Secondary analysis wizards" section. The page is now divided in three columns with the new section as the middle column.

Added link to "Register sequencing ended" wizard as the last library preparation wizard. Created two new annotation so that the counter can show a correct value.

Changed 6 years ago by Nicklas Nordborg

Second version of the overview

comment:8 Changed 6 years ago by Nicklas Nordborg

(In [2225]) References #533: Register sequencing as ended

New version of the 'sequencing ended' wizard. Added an option to specify if first base report failed. If so, the 'sequencing startup' wizard is enabled again for the flow cell. Other changes are mostly related to that there is only a single parent FlowCell for each SequencingRun.

comment:9 Changed 6 years ago by Nicklas Nordborg

Milestone: Reggie v2.15Reggie v2.16

comment:10 Changed 5 years ago by Nicklas Nordborg

(In [2292]) References #533: Add secondary analysis section to Reggie

Added SSHJ lib.

comment:11 Changed 5 years ago by Nicklas Nordborg

(In [2293]) References #533: Add secondary analysis section to Reggie

Adding some classes for keeping track of SSH/Open grid scheduler hosts. With a properly configured configuration file (reggie-ogs-hosts.xml) it seems to be possible to connect and execute a simple command on the remote server. Not much error handling though.

comment:12 Changed 5 years ago by Nicklas Nordborg

Status: newassigned

comment:13 Changed 5 years ago by Nicklas Nordborg

(In [2295]) References #533: Add secondary analysis section to Reggie

Added a page listing connected OGS clusters.

Added some more error handling when sending commands to servers via SSH.

comment:14 Changed 5 years ago by Nicklas Nordborg

(In [2296]) References #533: Add secondary analysis section to Reggie

Added a service extension for OpenGridService so that we can control it from the web interface (eg. reload settings after a configuration change).

Added OpenGridSignalHandlerFactory for taking care of ABORT and STATUS updates for jobs on a cluster. The demux servlet is used as a test bed but currently just faking communication with the cluster.

comment:15 Changed 5 years ago by Nicklas Nordborg

(In [2297]) References #533: Add secondary analysis section to Reggie

Added a "console"-like page for manually executing commands via ssh on a remote server.

comment:16 Changed 5 years ago by Nicklas Nordborg

(In [2298]) References #533: Add secondary analysis section to Reggie

Reorganize a few files.

comment:17 Changed 5 years ago by Nicklas Nordborg

(In [2299]) References #533: Add secondary analysis section to Reggie

Test for adding a "real" fake job to the cluster queue. The job id is stored as the Job.externalId in BASE and in the 'signal handler'.

The OpenGridSignalHandler class has been updated with "real" support for the "ABORT" signal. Seems to work, but error handling is non-existent.

Added logging to make debugging easier.

comment:18 Changed 5 years ago by Nicklas Nordborg

(In [2300]) References #533: Add secondary analysis section to Reggie

First attempts to keep track of job status. Added OpenGridCluster.submitJob() to start jobs.

Added OpenGridService.jobStatusTimer and OpenGridCluster.updateJobStatus() that is used for regular checking with the cluster. Sending 'qstat -xml' should get a list of queued and running jobs. Internally we keep a list of job ids that the BASE server has been interested in and if some of those were not listed in 'qstat' it is probably because the job has ended. We need to get information about that using 'qacct' (to be implemented) and possible with other commands. For debugging purposes we simply set those to DONE since they would otherwise be left in EXECUTING state forever.

comment:19 Changed 5 years ago by Nicklas Nordborg

(In [2301]) References #533: Add secondary analysis section to Reggie

Avoid duplicating status checks with the cluster (when 'qacct' has been impemented) if the BASE server is submitting status update requests with higher frequence than the actual status updating is happening on our end (eg. if getJobStatus() is called by another thread while updateJobStatus() is executed by the timer thread).

comment:20 Changed 5 years ago by Nicklas Nordborg

(In [2302]) References #533: Add secondary analysis section to Reggie

Get job results via 'qacct' command.

comment:21 Changed 5 years ago by Nicklas Nordborg

(In [2305]) References #533: Add secondary analysis section to Reggie

Generating script for 'qsub' on the BASE server side that is transmitted to the cluster via scp before it is added to the queue. Need to think a bity about naming conventions and where to place stdout and stderr files.

comment:22 Changed 5 years ago by Nicklas Nordborg

(In [2309]) References #533: Add secondary analysis section to Reggie

Using the new ability to access BASE from a service extension (http://base.thep.lu.se/ticket/1799) to update job status.

Improved error handling a bit. If a job fails the first part of stderr is used as an error message to the user.

comment:23 Changed 5 years ago by Nicklas Nordborg

(In [2310]) References #533: Add secondary analysis section to Reggie

Appending node name to the cluster name when the job is running. 't' status is appearing a short time between 'qw' and 'r' and we handle it the same as 'qw'.

comment:24 Changed 5 years ago by Nicklas Nordborg

(In [2323]) References #533: Add secondary analysis section to Reggie

Adding possibility to set configuration values for programs used on the cluster.

comment:25 Changed 5 years ago by Nicklas Nordborg

(In [2324]) References #533: Add secondary analysis section to Reggie

Creating a separate working folder for each job. The configuration file need to specify a base directory in attribute 'job-folder'. Each job get a subdirectory in this folder: <job-folder>/job-name. The auto-generated job script is always named 'job.sh' and output streams 'stdout' and 'stderr'.

comment:26 Changed 5 years ago by Nicklas Nordborg

(In [2325]) References #533: Add secondary analysis section to Reggie

Waiting 10 seconds after parsing qstat information before calling qacct since it has happened that recently finished jobs are not seen by qacct (eg. it returns error: job id <id> not found). I hope 10 seconds is enough.

comment:27 Changed 5 years ago by Nicklas Nordborg

(In [2336]) References #533: Add secondary analysis section to Reggie

Last checkin [2325] didn't solve the problem of missing jobs. Seems like there is a delay before information about finished jobs are written to the log file (https://arc.liv.ac.uk/pipermail/gridengine-users/2005-August/006253.html). The default value for the flush_time seems to be 15 seconds.

Instead of changing the Thread.sleep time, the code now accepts a first "job not found" error but if it happens a second time (which it shouldn't since we ony check once per minute) the job is set to error status.

comment:28 Changed 5 years ago by Nicklas Nordborg

Description: modified (diff)

comment:29 Changed 5 years ago by Nicklas Nordborg

(In [2376]) References #533: Add secondary analysis section to Reggie

Use 'x' instead of 'dx' as suffix for DemuxedSequences.

comment:30 Changed 5 years ago by Nicklas Nordborg

(In [2390]) References #533: Add secondary analysis section to Reggie

Use the word 'confirm' instead of 'register'

comment:31 Changed 5 years ago by Nicklas Nordborg

(In [2392]) References #533: Add secondary analysis section to Reggie

Added counter for unconfirmed alignment results.

comment:32 Changed 5 years ago by Nicklas Nordborg

(In [2400]) References #533: Add secondary analysis section to Reggie

Change '_end' in jsp names to '_confirm'.

comment:33 Changed 5 years ago by Nicklas Nordborg

(In [2420]) References #533, #547, #548, #593, #595. Renamed FilteredSequences subtype to MaskedSequences and the related software and protocol type. Renamed annotations NumReads to READS and PassedFilterReads to PF_READS and added new annotation for number of reads on the masked (PM_READS) and aligned level (ALIGNED_PAIRS).

Lots of related changes in the code to make class and variable names match the new names.

comment:34 Changed 5 years ago by Nicklas Nordborg

(In [2510]) References #533: Add secondary analysis section to Reggie

Include reggie-ogs-hosts.xml in distribution.

comment:35 Changed 5 years ago by Nicklas Nordborg

(In [2512]) References #533: Add secondary analysis section to Reggie

Added support for using InputStream/OutputStream? when reading/writing files to remote hosts.

comment:36 Changed 5 years ago by Nicklas Nordborg

(In [2534]) References #533 and BASE ticket http://base.thep.lu.se/ticket/1824. Call Services.restart() so that the BASE core can catch any exceptions.

comment:37 Changed 5 years ago by Nicklas Nordborg

(In [2562]) References #533: Add secondary analysis section to Reggie

Installing a SFT file server item 'ProjectArchive?'. This contains login information for connecting to the file server where the 'project archive' is located. The file server can then be used to create external file links to FASTQ and other data files and then link them to the corresponding bioassay items.

comment:38 Changed 5 years ago by Nicklas Nordborg

(In [2563]) References #533: Add secondary analysis section to Reggie

Use fingerprint instead of BASE64-encoded public key to verify connections to SSH servers. Fingerprints are shorter and easier to handle is more compatible with new feature in BASE FileServer? items.

comment:39 Changed 5 years ago by Nicklas Nordborg

(In [2564]) References #533: Add secondary analysis section to Reggie

Forgot this file as part of [2562].

comment:40 Changed 5 years ago by Nicklas Nordborg

(In [2565]) References #533: Add secondary analysis section to Reggie

Installing file types for FASTQ/BAM and associating them with the MergedSequences? and AlignedSequences? bioassay type.

comment:41 Changed 5 years ago by Nicklas Nordborg

Description: modified (diff)

comment:42 Changed 5 years ago by Nicklas Nordborg

Resolution: fixed
Status: assignedclosed

comment:43 Changed 5 years ago by Nicklas Nordborg

(In [2704]) References #533: Add secondary analysis section to Reggie

Failure to download result files should result in error status for the job.

comment:44 Changed 5 years ago by Nicklas Nordborg

(In [2709]) References #533: Add secondary analysis section to Reggie

Making installation wizard more rubust in case items has been created in non-optimal order. Eg. data file type items and some subtypes must be created in a special order, but a developer may have created some items already which are not in the expected order.

comment:45 Changed 5 years ago by Nicklas Nordborg

(In [2715]) References #533: Add secondary analysis section to Reggie

Installation wizard should only warn if username, etc. is not configured on ProjectArchive file server.

Note: See TracTickets for help on using tickets.