Opened 10 years ago

Closed 9 years ago

#807 closed (fixed)

PRIDE export of MGF-based searches

Reported by: Fredrik Levander Owned by: olle
Milestone: Proteios SE 2.19.0 Keywords:
Cc:

Description

Currently PRIDE export only supports mzData-based searches. However, MGF is nowadays the most frequently used peak list format, and the PRIDE export should be updated to generate the mzData part based on the MGF in cases where MGF was used as peak list.

Probably what needs to be done is to update the code so that after calling writePrideXmlHeader(dc, xmlCrudeWriter) in doExport, a check needs to be done on the filetype of the peak list. If it is a mzData the current code can be used, and otherwise the new conversion needs to be called (provided that it is an MGF, otherwise an exception should be thrown). It is then good if the writer has functionality for writing custom sample and contact blocks (as currently can be done when using mzData peak lists), and which are options in the PRIDE export.

There is functionality for writing mzData based on peak lists in the PKL to mzData converter plugin. Changes in the current PRIDE xml is that the MS ontology is used instead of the old PSI ontology. This means that the instrument part can be retrieved from an associated mzML file if present. The last appearing instrumentConfiguration block should be used (there is probably a referencable param group holding the instrument name). Furthermore, other PSI terms should be updated to MS terms, as can be seen in recent PRIDE xml files.

A caveat is that spectrum IDs in the mzData part are integers, starting with one, while the hits in the hits table are matching spectrum string ids to the MGF TITLE lines. The spectrum id in the hits table are not equal to the spectrum number in the MGF file. This means that a mzData spectrum id to MGF TITLE map will have to be generated when converting the MGF and writing the mzData part of the file, and then this map will be needed for writing the correct SpectrumReference? for peptides.

Currently, protein assembly needs to be done per file for the export to work properly. This is due to limitations in the PRIDE XML, which only allows one peak list per PRIDE XML file. A warning about this should probably turn up somewhere.

Change History (32)

comment:1 Changed 10 years ago by olle

Status: newassigned

Ticket accepted.

comment:2 Changed 10 years ago by olle

Traceability note:

  • PRIDE XML export was introduced in Ticket #405 (Support export for publication).
  • PRIDE XML export was updated in Ticket #694 (Non-gel PRIDE export fails).
  • PRIDE XML export was updated in Ticket #701 (Add more info to PRIDE export).
  • PRIDE XML export was updated in Ticket #805 (Support for simplifying ProteomeXchange submission) to annotate the created PRIDE XML file with the filename of the peaklist file.
  • A PKL to mzData converter was introduced in Ticket #188 (Batch conversion of peaklists).
  • An MGF to mzData converter was introduced in Ticket #808 (MGF to mzData Converter).
  • Automatic file conversion before basic processing was introduced in Ticket #591 (Workflow for using mzData files with OMSSA), where mzData files are converted to MGF files before OMSSA search.
Last edited 10 years ago by olle (previous) (diff)

comment:3 Changed 10 years ago by olle

(In [4478]) Refs #808. Refs #807. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated to add the contents of the TITLE= header line in the MGF file to a comments tag in the SpectrumDesc tag of the corresponding spectrum in the created mzData file. The title string will be prefixed by "title=" in the comments tag:

  1. Public method void doConvert(InputStream instream, OutputStream outstream, ProgressReporter progress) updated to read the contents of the TITLE= header line in the MGF file and transfer it to private method void writeMzDataSpectrumDescBlock(...) through new argument String title.
  2. Private method void writeMzDataSpectrumDescBlock(...) updated with new argument String title. It will add the title to a comments tag in the SpectrumDesc tag of the spectrum, after adding prefix "title=".

comment:4 Changed 10 years ago by olle

Design update to allow automatic conversion of peaklist MGF files to mzData files for PRIDE export:

  • OMSSA search from Proteios SE already contains automatic conversion of input peaklist files, see Ticket #591 (Workflow for using mzData files with OMSSA), where optionally mzData files are converted to MGF before search, i.e. conversion in the opposite direction than of interest here. The design here will be modeled on the one for OMSSA search.

Update of class/file action/hit/CreatePrideExportJob.java in client/servlet/:

  1. If the peaklist file is an MGF file, a flag is set to create an mzData file, provided a converted file does not already exist. The value of the flag variable is transferred to the created job as a job parameter.
  2. If the conversion flag is set, a conversion job is created using plug-in class MgfToMzDataPlugin. The conversion job is set as blocker to the PRIDE XML export job, ensuring that the former is executed first.
  3. Three private convenience methods are added:

    a. boolean convertedMgfToMzDataFileExists(DbControl dc, File spectrumFile)
    b. File fetchConvertedMgfToMzDataFile(DbControl dc, File spectrumFile)
    c. Job createMgfToMzDataConvertJob(DbControl dc, File spectrumFile)

Update of class/file plugins/PrideExportPlugin.java in plugin/:

  1. For clarity, a new variable File mzDataFile is introduced for the file from which the mzData XML block is to be copied, in addition to previous variable File peakListFile, which is the file for which peptide identification searches have been made. If the file conversion flag from job parameters is false, the value of mzDataFile is set equal to peakListFile, otherwise new private method File fetchConvertedMgfToMzDataFile(ItemFactory factory, File spectrumFile) is called to retrieve the mzData file.
  2. If an MGF to mzData file conversion has been performed, the latter file is parsed for comments tags with spectrum title strings from the MGF file, in order to create a spectrum string id/spectrum id hash map.
  3. The input peaklist file is used in references to search hits in the PRIDE XML file, while the mzData file is used to copy the mzData part of the PRIDE XML file from.
  4. If an MGF to mzData file conversion has been performed, the spectrum id in the PRIDE XML SpectrumReference tag is obtained from the created spectrum string id/spectrum id hash map, using the spectrum string id of the search hit in the Hits table as key.
  5. Two private convenience methods are added:

    a. File fetchConvertedMgfToMzDataFile(ItemFactory factory, File spectrumFile)
    b. private HashMap<String, Integer> createSpectrumStringIdSpectrumIdHashMap(DbControl dc, Integer sourceFileId)
Last edited 10 years ago by olle (previous) (diff)

comment:5 Changed 10 years ago by olle

(In [4479]) Refs #807. PRIDE XML export is updated to support automatic conversion of MGF files to mzData for use in the mzData part of the PRIDE XML file:

Update of class/file action/hit/CreatePrideExportJob.java in client/servlet/:

  1. If the peaklist file is an MGF file, a flag is set to create an mzData file, provided a converted file does not already exist. The value of the flag variable is transferred to the created job as a job parameter.
  2. If the conversion flag is set, a conversion job is created using plug-in class MgfToMzDataPlugin. The conversion job is set as blocker to the PRIDE XML export job, ensuring that the former is executed first.
  3. Three private convenience methods are added:

    a. boolean convertedMgfToMzDataFileExists(DbControl dc, File spectrumFile)
    b. File fetchConvertedMgfToMzDataFile(DbControl dc, File spectrumFile)
    c. Job createMgfToMzDataConvertJob(DbControl dc, File spectrumFile)

Update of class/file plugins/PrideExportPlugin.java in plugin/:

  1. For clarity, a new variable File mzDataFile is introduced for the file from which the mzData XML block is to be copied, in addition to previous variable File peakListFile, which is the file for which peptide identification searches have been made. If the file conversion flag from job parameters is false, the value of mzDataFile is set equal to peakListFile, otherwise new private method File fetchConvertedMgfToMzDataFile(ItemFactory factory, File spectrumFile) is called to retrieve the mzData file.
  2. If an MGF to mzData file conversion has been performed, the latter file is parsed for comments tags with spectrum title strings from the MGF file, in order to create a spectrum string id/spectrum id hash map.
  3. The input peaklist file is used in references to search hits in the PRIDE XML file, while the mzData file is used to copy the mzData part of the PRIDE XML file from.
  4. If an MGF to mzData file conversion has been performed, the spectrum id in the PRIDE XML SpectrumReference tag is obtained from the created spectrum string id/spectrum id hash map, using the spectrum string id of the search hit in the Hits table as key.
  5. Two private convenience methods are added:

    a. File fetchConvertedMgfToMzDataFile(ItemFactory factory, File spectrumFile)
    b. private HashMap<String, Integer> createSpectrumStringIdSpectrumIdHashMap(DbControl dc, Integer sourceFileId)

comment:6 Changed 10 years ago by olle

(In [4480]) Refs #808. Refs #807. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated by retracting the update to MS ontology in changeset [4477], since the PSI ontology is needed for valid mzData file.

comment:7 Changed 10 years ago by olle

(In [4481]) Refs #807. PRIDE XML export is updated to convert selected cvParam tags from PSI to MS ontology, when copying XML data from mzData file to the mzData part of the PRIDE XML file:

Update of class/file plugins/PrideExportPlugin.java in plugin/:

  1. Public method void doExport(DbControl dc, File outCoreFile, ProgressReporter progress) updated to call private method void copySelectedXMLBlocks(..., boolean convertPsiToMsOntology) with value of new argument boolean convertPsiToMsOntology set to true when copying XML blocks, that may contain the selected cvParam tags.
  2. Private method copySelectedXMLBlocks(...) updated with a variant having an extra argument boolean convertPsiToMsOntology. If the value of the latter is true, new private method String exchangePsiForMsOntology(String line) is called to convert selected cvParam tags from PSI to MS ontology.
  3. New private method String exchangePsiForMsOntology(String line) added. It converts selected cvParam tags from PSI to MS ontology. Lines that do not contain the selected cvParam tags are returned unmodified.

comment:8 Changed 10 years ago by olle

(In [4482]) Refs #808. Refs #807. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated to ensure that DbControl instances are closed after use. Private method void writeSpectraData(XMLCrudeWriter2 xmlCWriter, ...) updated to use existing DbControl instance, instead of creating new one:

  1. Public method void run(Request request, Response response, ProgressReporter progress) updated by closing DbControl used to get directory information from input file item.
  2. Public method void doConvert(InputStream instream, Directory outCoreDir, String filename, ProgressReporter progress) updated in call of public method void doConvert(... InputStream instream, OutputStream outstream, ProgressReporter progress) to set new argument DbControl dc to existing DbControl instance.
  3. Public method void doConvert(InputStream instream, OutputStream outstream, ProgressReporter progress) updated with new initial argument DbControl dc, whose value is used in call of private method void writeSpectraData(... XMLCrudeWriter2 xmlCWriter, List<Double> peakMassData, List<Double> peakIntensityData, boolean mz_double_precision, boolean inten_double_precision) to set new argument DbControl dc to existing DbControl instance.
  4. Private method void writeSpectraData(XMLCrudeWriter2 xmlCWriter, List<Double> peakMassData, List<Double> peakIntensityData, boolean mz_double_precision, boolean inten_double_precision) updated with new initial argument DbControl dc, whose value is used instead of creating new DbControl instance.

comment:9 Changed 10 years ago by olle

(In [4484]) Refs #808. Refs #807. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated in private method void writeMzDataSpectrumDescBlock(XMLCrudeWriter2 xmlCWriter, Float massToChargeRatio, Float intensity, Integer chargeState, Float rtInSeconds, String title) to put the comments tag in the SpectrumDesc tag of the corresponding spectrum in the created mzData file after the spectrumSettings and precursorList tags, in order to be valid according to the mzData XSD (XML Schema Description).

comment:10 Changed 10 years ago by olle

Design update to retrieve instrument part in the mzData XML block in the PRIDE XML file from an mzML file associated with the peaklist file:

  • Use of class XMLCrudeWriter3Impl instead of XMLCrudeWriterImpl for writing XML data, since the former allows the indentation level to be specified explicitly.
  • New instance variables to store the instrument data retrieved from an associated mzML file.
  • New private method File fetchAlternativeSpectrumFile(ItemFactory factory, File spectrumFile, String alternativeFileExtension) to fetch an alternative spectrum file.
  • Private method void copySelectedXMLBlocks(...) updated with new argument int firstLineExtraIndentation.
  • New private method String fetchInstrumentName(File sourceFile) for retrieving the instrument name from a cvParam tag in "referenceableParamGroup" XML tag in a source mzML file.
  • New private method List<String> fetchInstrumentCvParamList(File sourceFile, String tagName) for retrieving a list of inside contents of cvParam tags in selected XML tag in a source mzML file. Only data from the last block of the selected XML tag is returned.
  • New private method void writeInstrumentBlock(XMLCrudeWriter3Impl xmlCrudeWriter) to write an mzData instrument XML block with data from instance variables, whose values have been retrieved from an associated mzML file.
  • Public method void doExport(DbControl dc, File outCoreFile, ProgressReporter progress) updated to write instrument data in mzData part of PRIDE XML file from information retrieved from an associated mzML file. First part of method handling different options also rewritten, in order to increase clarity and simplify future additions.
Last edited 10 years ago by olle (previous) (diff)

comment:11 Changed 10 years ago by olle

(In [4486]) Refs #807. PRIDE XML export is updated to retrieve instrument part in the mzData XML block in the PRIDE XML file from an mzML file associated with the peaklist file:

Class/file plugins/MgfToMzDataPlugin.java in plugin/ update:

  1. Use of class XMLCrudeWriter3Impl instead of XMLCrudeWriterImpl for writing XML data, since the former allows the indentation level to be specified explicitly.
  2. New instance variables to store the instrument data retrieved from an associated mzML file.
  3. New private method File fetchAlternativeSpectrumFile(ItemFactory factory, File spectrumFile, String alternativeFileExtension) to fetch an alternative spectrum file.
  4. Private method void copySelectedXMLBlocks(...) updated with new argument int firstLineExtraIndentation.
  5. New private method String fetchInstrumentName(File sourceFile) for retrieving the instrument name from a cvParam tag in "referenceableParamGroup" XML tag in a source mzML file.
  6. New private method List<String> fetchInstrumentCvParamList(File sourceFile, String tagName) for retrieving a list of inside contents of cvParam tags in selected XML tag in a source mzML file. Only data from the last block of the selected XML tag is returned.
  7. New private method void writeInstrumentBlock(XMLCrudeWriter3Impl xmlCrudeWriter) to write an mzData instrument XML block with data from instance variables, whose values have been retrieved from an associated mzML file.
  8. Public method void doExport(DbControl dc, File outCoreFile, ProgressReporter progress) updated to write instrument data in mzData part of PRIDE XML file from information retrieved from an associated mzML file. First part of method handling different options also rewritten, in order to increase clarity and simplify future additions.
Last edited 10 years ago by olle (previous) (diff)

comment:12 Changed 10 years ago by olle

Design update for adding sample and contact information to PRIDE XML export file:

Class/file plugins/MgfToMzDataPlugin.java in plugin/ update:

  1. Public method void doExport(DbControl dc, File outCoreFile, ProgressReporter progress) update:
    a. Bug fix: Start tag after adding sample information changed from "contact" to "sourceFile".
    b. First line extra indentation specified for the different cases.
    c. Old code that was commented out is now removed.
  2. Private method void writeContactBlock(...) updated to use class XMLCrudeWriter3Impl instead of XMLCrudeWriterImpl for writing XML data, since the former allows the indentation level to be specified explicitly. Extra indentation specified.
  3. Private method void writeSampleBlock(...) updated to use class XMLCrudeWriter3Impl instead of XMLCrudeWriterImpl for writing XML data, since the former allows the indentation level to be specified explicitly. Extra indentation specified.

comment:13 Changed 10 years ago by olle

(In [4487]) Refs #807. PRIDE XML export is updated for adding sample and contact information to PRIDE XML export file:

Class/file plugins/MgfToMzDataPlugin.java in plugin/ update:

  1. Public method void doExport(DbControl dc, File outCoreFile, ProgressReporter progress) update:
    a. Bug fix: Start tag after adding sample information changed from "contact" to "sourceFile".
    b. First line extra indentation specified for the different cases.
    c. Old code that was commented out is now removed.
  2. Private method void writeContactBlock(...) updated to use class XMLCrudeWriter3Impl instead of XMLCrudeWriterImpl for writing XML data, since the former allows the indentation level to be specified explicitly. Extra indentation specified.
  3. Private method void writeSampleBlock(...) updated to use class XMLCrudeWriter3Impl instead of XMLCrudeWriterImpl for writing XML data, since the former allows the indentation level to be specified explicitly. Extra indentation specified.

comment:14 Changed 10 years ago by olle

(In [4488]) Refs #807. Class/file plugins/PrideExportPlugin.java in plugin/ updated in public method void run(Request request, Response response, ProgressReporter progress) to throw a BaseException if no mzData file can be found (whether original or converted from peaklist file).

Last edited 10 years ago by olle (previous) (diff)

comment:15 Changed 10 years ago by olle

Design update:

  • PRIDE XML export should be updated to allow addition of species information in the sampleDescription XML tag in the mzData part of the PRIDE XML file. The species information will be entered by the user in the form used to create the PRIDE XML export job. Species ontology, accession number, and name are all required for the information to be added to the PRIDE XML file.
  1. Class/file action/hit/PrideExport.java in client/servlet/ updated by adding new fields related to species information to the sample section of the form. The fields are coupled to new valid parameters defined in class CreatePrideExportJob. New private method Fieldset getSampleFieldset() added to increase code readability.
  2. Class/file action/hit/SelectPrideProtocolFileStep1a.java in client/servlet/ updated to retrieve values of new valid parameters related to species information from the request and saving them as session attributes.
  3. Class/file action/hit/CreatePrideExportJob.java in client/servlet/ updated with new valid parameters related to species information. The values of the parameters are retrieved from session attributes and transferred to the created job as job parameters.
  4. Class/file plugins/PrideExportPlugin.java in plugin/ updated to retrieve the values of new variables related to species information from job parameters. Private method void writeSampleBlock(XMLCrudeWriter3Impl xmlCrudeWriter) updated to write a cvParam tag in the sampleDescription tag with the species information, provided that this option has been selected and species ontology, accession number, and name have all been specified.
  5. English dictionary file locale/en/dictionary in client/servlet/ updated with new entries for various string keys.

comment:16 Changed 10 years ago by olle

(In [4489]) Refs #807. PRIDE XML export updated to allow addition of species information in the sampleDescription XML tag in the mzData part of the PRIDE XML file. The species information will be entered by the user in the form used to create the PRIDE XML export job. Species ontology, accession number, and name are all required for the information to be added to the PRIDE XML file.

  1. Class/file action/hit/PrideExport.java in client/servlet/ updated by adding new fields related to species information to the sample section of the form. The fields are coupled to new valid parameters defined in class CreatePrideExportJob. New private method Fieldset getSampleFieldset() added to increase code readability.
  2. Class/file action/hit/SelectPrideProtocolFileStep1a.java in client/servlet/ updated to retrieve values of new valid parameters related to species information from the request and saving them as session attributes.
  3. Class/file action/hit/CreatePrideExportJob.java in client/servlet/ updated with new valid parameters related to species information. The values of the parameters are retrieved from session attributes and transferred to the created job as job parameters.
  4. Class/file plugins/PrideExportPlugin.java in plugin/ updated to retrieve the values of new variables related to species information from job parameters. Private method void writeSampleBlock(XMLCrudeWriter3Impl xmlCrudeWriter) updated to write a cvParam tag in the sampleDescription tag with the species information, provided that this option has been selected and species ontology, accession number, and name have all been specified.
  5. English dictionary file locale/en/dictionary in client/servlet/ updated with new entries for various string keys.

comment:17 Changed 10 years ago by olle

(In [4490]) Refs #807. Class/file action/hit/CreatePrideExportJob.java in client/servlet/ updated by removal of definition of valid parameter VString VSAMPLENAME, that was erroneously added in changeset [4489].

comment:18 Changed 10 years ago by olle

(In [4491]) Refs #807. PRIDE XML export updated to exclude Hit table entries with score type "Proteios aligned", that are created using "Propagate Feature Sequences":

  1. Class/file plugins/PrideExportPlugin.java in plugin/ updated in private method ItemQuery<Hit> createPeptideHitQuery(...) to exclude hits with score type "Proteios aligned".

comment:19 Changed 10 years ago by olle

(In [4492]) Refs #515. Refs #807. Classes/files io/XMLCrudeWriterImpl.java and io/XMLCrudeWriter3Impl.java in api/core/ updated to not throw an exception when attempting to write a null string:

  1. Private method void writeStr(String string) updated to return directly if the input string is null.

comment:20 Changed 10 years ago by Fredrik Levander

(In [4493]) Refs #809. Refs #807. Update of PRIDE xml peptide modification writing to use help functions in core classes for parsing and conversion to PSI MOD. Update of mzML instrument retrieval to support both referenceable param groups and in-block cvParam representations for instrument name.

comment:21 Changed 10 years ago by Fredrik Levander

(In [4494]) Refs #807. Throwing exception when no mzData file to use for export of spectra. PRIDE XML export of identifications only is thus disabled.

comment:22 Changed 10 years ago by Fredrik Levander

(In [4495]) Refs #807. Fix to not retrieve instrument serial number instead of instrument name from mzML file.

comment:23 Changed 10 years ago by olle

(In [4496]) Refs #807. PRIDE XML export updated to exclude Hit table entries with score type "Proteios aligned" when finding peaklist files, for which to create export jobs:

  1. Class/file action/hit/CreatePrideExportJob.java in client/servlet/ updated in protected method void runMe() to exclude Hit table entries with score type "Proteios aligned" when finding peaklist files, for which to create export jobs.

comment:24 Changed 10 years ago by olle

(In [4497]) Refs #807. PRIDE XML export updated to search for an alternative mzML file in other directories in project, if not found in same directory as the peaklist file:

  1. Class/file plugins/PrideExportPlugin.java in plugin/ updated:
    a. Public method void run(Request request, Response response, ProgressReporter progress) updated to call new private method File fetchAlternativeSpectrumFileInProject(ItemFactory factory, File spectrumFile, String alternativeFileExtension) to obtain an alternative mzML file, if none is found in same directory as the peaklist file.
    b. New private method File fetchAlternativeSpectrumFileInProject(ItemFactory factory, File spectrumFile, String alternativeFileExtension) added. It searches for an alternative spectrum file in the project. If several exist, the first found is returned.

comment:25 Changed 10 years ago by olle

(In [4498]) Refs #808. Refs #807. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated in public method void doConvert(DbControl dc, InputStream instream, OutputStream outstream, ProgressReporter progress) to set total intensity for a spectra to 0 if no value exists in spectrum header line starting with "PEPMASS=".

comment:26 Changed 10 years ago by olle

(In [4499]) Refs #808. Refs #807. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated in private method void writeMzDataSoftwareBlock(XMLCrudeWriter2 xmlCWriter) to specify that the data has been converted from MGF format using Proteios SE.

comment:27 Changed 10 years ago by olle

(In [4500]) Refs #807. PRIDE XML export updated to show link to NEWT ontology look-up service in form:

  1. Class/file action/hit/PrideExport.java in client/servlet/ updated in private method Fieldset getSampleFieldset() by adding help text with link to NEWT ontology look-up service at http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=NEWT.

comment:28 Changed 9 years ago by Fredrik Levander

(In [4510]) Refs #807. Exchanged ontology term for m/z in PRIDE output.

comment:29 Changed 9 years ago by Fredrik Levander

(In [4511]) Refs #808. Refs #807. Updated converter to output as many digits as in input file, which also avoid local decimal/comma problems.

comment:30 Changed 9 years ago by Fredrik Levander

(In [4514]) Refs #807. Added some new required fields to the generated PRIDE XML. Parsing database version.

comment:31 Changed 9 years ago by Fredrik Levander

(In [4516]) Refs #808. Refs #807. Changed writing of zero length data arrays to write one 0 value instead, since PRIDE does not support zero length arrays.

comment:32 Changed 9 years ago by olle

Resolution: fixed
Status: assignedclosed

Ticket closed as first version of PRIDE export of MGF-based searches has been added.

Note: See TracTickets for help on using tickets.