Opened 10 years ago

Closed 10 years ago

Last modified 9 years ago

#808 closed (fixed)

MGF to mzData Converter

Reported by: olle Owned by: olle
Milestone: Proteios SE 2.19.0 Keywords:
Cc:

Description

A plug-in for converting MGF files to mzData files could be useful in cases where one needs a spectrum file in mzData format, but only an MGF file is available.

Change History (17)

comment:1 Changed 10 years ago by olle

Status: newassigned

Ticket accepted.

comment:2 Changed 10 years ago by olle

Design description:

  • MGF (Mascot Generic Format) stored data in a plain non-XML text format, where each spectra begins with a number of header lines starting with a "BEGIN IONS" line, followed by lines of mass peak data value pairs for m/z and intensity values, and ending with an "END IONS" line. The mass peak lines are similar to those in PKL files, where each spectra begins with a line with base peak mass, total intensity, and charge, followed by lines of mass peak data value pairs for m/z and intensity values, and ending with an empty line (except for the last spectra). The converter will therefore be based on the existing "PKL to mzData" converter plug-in developed in Ticket #188 (Batch conversion of peaklists).

Modifications made for the MGF to mzData converter, other than those needed to parse the MGF file:

  1. Values from spectrum header line starting with "RTINSECONDS=" are added in cvParam tag "TimeInSeconds" with PSI accession number 1000039 to spectrumInstrument tag.
  2. Values for base peak mass and total intensity are obtained from spectrum header line starting with "PEPMASS=".
  3. Values for charge state is obtained from spectrum header line starting with "CHARGE=". Note that in MGF files the charge sign is added as suffix, e.g. "CHARGE=2+".
  4. Peak m/z float values were allowed to skip the decimal dot, e.g. "646.0000000" could be represented by "646".
  5. The name of the converted file is constructed by adding ".mzData" to the input filename, including the file extension ".mgf". The converted file will therefore have file extension ".mgf.mzData", clearly indicating that the file was converted from an MGF file.
Last edited 10 years ago by olle (previous) (diff)

comment:3 Changed 10 years ago by olle

(In [4476]) Refs #808. Added MGF to mzData plug-in class/file plugins/MgfToMzDataPlugin.java in plugin/. It is based on PKL to mzData plug-in class/file plugins/PklToMzDataPlugin.java in plugin/.

comment:4 Changed 10 years ago by olle

Design update:

  • One of the main purposes of this converter is to be able to create a PRIDE XML file, when the peak list file is in MGF instead of mzData format, that is supported by PRIDE XML. Since the current PRIDE XML standard prefers the MS ontology over of the old PSI ontology, the former ontology should preferably be used in the mzData file.

Currently four PSI ontology terms occur in the MGF to mzData converter, where all accession numbers exist in both the PSI and MS ontology:

PSI Term PSI Name MS Term MS Name Comment
PSI:1000039 TimeInSeconds MS:1000039 second This MS term is obsolete, as it has been replaced by MS:1000016, "scan time"
PSI:1000040 MassToChargeRatio MS:1000040 m/z
PSI:1000041 ChargeState MS:1000041 charge state
PSI:1000042 Intensity MS:1000042 peak intensity

Simple replacing the obsolete term MS:1000039 with MS:1000016 leads to problems, as both terms PSI:1000038, "TimeInMinutes", and PSI:1000039, "TimeInSeconds", correspond to MS:1000016, "scan time", and need to be supplemented with a unit accession number, in order to be unambiguous. However, it is not clear if an mzData cvParam tag accepts separate unit specifications (mzData 1.05 specification did not). Until this has been resolved, the obsolete term MS:1000039, "second", will be used as replacement for PSI:1000039, "TimeInSeconds".

Last edited 10 years ago by olle (previous) (diff)

comment:5 Changed 10 years ago by olle

(In [4477]) Refs #808. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated to use MS instead of PSI ontology in cvParam tags. PSI term PSI:1000039, "TimeInSeconds" is exchanged for the obsolete MS term MS:1000039, "second", instead of the preferred term MS:1000016, "scan time", since the latter is ambiguous without a unit accession number.

comment:6 Changed 10 years ago by olle

Design update:

  • Each spectra in the MGF file may have a title line with prefix "TITLE=" in the header lines block. Since the order of spectra in the MGF files does not always correspond to the "Spectrum ID" in the Hits table, the title string should be added to a comments tag in the SpectrumDesc tag of the created mzData file, in order to unambiguously identifying the spectra in the latter. The title string will be prefixed by "title=" in the comments tag.

comment:7 Changed 10 years ago by olle

(In [4478]) Refs #808. Refs #807. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated to add the contents of the TITLE= header line in the MGF file to a comments tag in the SpectrumDesc tag of the corresponding spectrum in the created mzData file. The title string will be prefixed by "title=" in the comments tag:

  1. Public method void doConvert(InputStream instream, OutputStream outstream, ProgressReporter progress) updated to read the contents of the TITLE= header line in the MGF file and transfer it to private method void writeMzDataSpectrumDescBlock(...) through new argument String title.
  2. Private method void writeMzDataSpectrumDescBlock(...) updated with new argument String title. It will add the title to a comments tag in the SpectrumDesc tag of the spectrum, after adding prefix "title=".

comment:8 Changed 10 years ago by olle

(In [4480]) Refs #808. Refs #807. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated by retracting the update to MS ontology in changeset [4477], since the PSI ontology is needed for valid mzData file.

comment:9 Changed 10 years ago by olle

(In [4482]) Refs #808. Refs #807. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated to ensure that DbControl instances are closed after use. Private method void writeSpectraData(XMLCrudeWriter2 xmlCWriter, ...) updated to use existing DbControl instance, instead of creating new one:

  1. Public method void run(Request request, Response response, ProgressReporter progress) updated by closing DbControl used to get directory information from input file item.
  2. Public method void doConvert(InputStream instream, Directory outCoreDir, String filename, ProgressReporter progress) updated in call of public method void doConvert(... InputStream instream, OutputStream outstream, ProgressReporter progress) to set new argument DbControl dc to existing DbControl instance.
  3. Public method void doConvert(InputStream instream, OutputStream outstream, ProgressReporter progress) updated with new initial argument DbControl dc, whose value is used in call of private method void writeSpectraData(... XMLCrudeWriter2 xmlCWriter, List<Double> peakMassData, List<Double> peakIntensityData, boolean mz_double_precision, boolean inten_double_precision) to set new argument DbControl dc to existing DbControl instance.
  4. Private method void writeSpectraData(XMLCrudeWriter2 xmlCWriter, List<Double> peakMassData, List<Double> peakIntensityData, boolean mz_double_precision, boolean inten_double_precision) updated with new initial argument DbControl dc, whose value is used instead of creating new DbControl instance.

comment:10 Changed 10 years ago by olle

(In [4483]) Refs #808. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated in public method void doConvert(DbControl dc, InputStream instream, OutputStream outstream, ProgressReporter progress) to only write log message with value of numberOfSpectra at end of conversion, instead of for each spectra.

comment:11 Changed 10 years ago by olle

(In [4484]) Refs #808. Refs #807. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated in private method void writeMzDataSpectrumDescBlock(XMLCrudeWriter2 xmlCWriter, Float massToChargeRatio, Float intensity, Integer chargeState, Float rtInSeconds, String title) to put the comments tag in the SpectrumDesc tag of the corresponding spectrum in the created mzData file after the spectrumSettings and precursorList tags, in order to be valid according to the mzData XSD (XML Schema Description).

comment:12 Changed 10 years ago by olle

Resolution: fixed
Status: assignedclosed

Ticket closed as the requested functionality has been added.

comment:13 Changed 10 years ago by olle

(In [4498]) Refs #808. Refs #807. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated in public method void doConvert(DbControl dc, InputStream instream, OutputStream outstream, ProgressReporter progress) to set total intensity for a spectra to 0 if no value exists in spectrum header line starting with "PEPMASS=".

comment:14 Changed 10 years ago by olle

(In [4499]) Refs #808. Refs #807. MGF to mzData converter in class/file plugins/MgfToMzDataPlugin.java in plugin/ updated in private method void writeMzDataSoftwareBlock(XMLCrudeWriter2 xmlCWriter) to specify that the data has been converted from MGF format using Proteios SE.

comment:15 Changed 9 years ago by Fredrik Levander

(In [4511]) Refs #808. Refs #807. Updated converter to output as many digits as in input file, which also avoid local decimal/comma problems.

comment:16 Changed 9 years ago by Fredrik Levander

(In [4513]) Refs #808. Updated converter to display hostname in path to sourcefile.

comment:17 Changed 9 years ago by Fredrik Levander

(In [4516]) Refs #808. Refs #807. Changed writing of zero length data arrays to write one 0 value instead, since PRIDE does not support zero length arrays.

Note: See TracTickets for help on using tickets.