Opened 11 years ago

Closed 11 years ago

#790 closed (fixed)

Quantitative hits comparison optimisation

Reported by: Fredrik Levander Owned by: olle
Milestone: Proteios SE 2.18.0 Keywords:
Cc:

Description

The quantitative hits comparison (#736) needs optimisation to support large datasets.

Change History (11)

comment:1 Changed 11 years ago by olle

Owner: set to olle

Ticket assigned to olle.

comment:2 Changed 11 years ago by olle

Status: newassigned

Ticket accepted.

comment:3 Changed 11 years ago by olle

Traceability note:

  • Quantitative hits comparison was introduced in Ticket #736 (quantitative comparison report).

comment:4 Changed 11 years ago by olle

Design discussion.

  • Currently, quantitative hits comparison is only defined for non gel-based data, so the design will be limited to this case.
  • Quantitative hits comparison is managed by class/file plugins/HitsComparisonQuantitativeReportPlugin.java in plugin/.
  • For convenience, data is copied to an instance of inner class HitsComparisonData, making it possible to use a single argument of this type when calling methods requiring access to different parts of the data. However, this is currently only used for two doExport(...) method calls in sequence, and one of them only performs file checks, and can be replaced by a simpler method. In order to not require more memory than necessary, use of inner class HitsComparisonData should be eliminated, at the expense of a longer argument list in the remaining doExport(...) method.
  • The generation of the comparison list contains a double loop, where for all external id (for proteins) or sequence values (for peptides), all hits in each of the two hit selections are checked. Both lists can be large, but only few hits in the hit selections contribute to the result for a specific external id or sequence value. It is therefore desirable to make this process more effective. The hit selections are already partly sorted after values for the variables in the outer loop. If the hit selections could be fully sorted after the latter values, effective search algorithms could used to find the hits of interest, eliminating the need to check all hits in the hit selections.

comment:5 Changed 11 years ago by olle

Design update:

  • Class/file plugins/HitsComparisonQuantitativeReportPlugin.java in plugin/ should be updated to no longer use inner class HitsComparisonData, in order to not require more memory than necessary, and to reduce the code size.
  • Each hit selection list should be created by a single database query, where the results are sorted after external id for protein comparison and sequence values for peptide comparison. In the main report generation loop, an algorithm exploring the sorting will be used to find the first list item (if any) with a specified value of the loop variable in the outer loop, instead of checking every item in the hit list. Since the list is sorted, all items with the specified value can be found by checking subsequent items in the list, until an item is found with a different value, or the end of the list is reached. Hopefully, this will reduce the report generation time for large data sets.
  • As a minor issue, Boolean variable modificationsIncluded should be renamed includeModifications, to clarify that it represents a logical value, and not a collection like a list or a set.
Last edited 11 years ago by olle (previous) (diff)

comment:6 Changed 11 years ago by olle

(In [4405]) Refs #790. Refs #736. Class/file plugins/HitsComparisonQuantitativeReportPlugin.java in plugin/ updated:

  1. Inner class HitsComparisonData will no longer be used as a temporary copy of the data, in order to not require more memory than necessary, and to reduce the code size.
  2. Public method void doExport(DbControl dc, HitsComparisonData hcd, Directory outCoreDir, String filename, ProgressReporter progress) is removed, and the output file is now obtained by new public method File fetchOutCoreFile(DbControl dc, Directory outCoreDir, String filename).

comment:7 Changed 11 years ago by olle

(In [4406]) Refs #790. Refs #736. Class/file plugins/HitsComparisonQuantitativeReportPlugin.java in plugin/ updated:

  1. Boolean variable modificationsIncluded is renamed includeModifications, to clarify that it represents a logical value, and not a collection like a list or a set.

comment:8 Changed 11 years ago by olle

(In [4407]) Refs #790. Refs #736. Class/file plugins/HitsComparisonQuantitativeReportPlugin.java in plugin/ updated to have comparison report generation optimized for large data sets. Each hit selection list is now created by a single database query, where the results are sorted after external id for protein comparison and sequence values for peptide comparison. In the main report generation loop, an algorithm exploring the sorting is used to find the first list item (if any) with a specified value of the loop variable in the outer loop, instead of checking every item in the hit list. Since the list is sorted, all items with the specified value can be found by checking subsequent items in the list, until an item is found with a different value, or the end of the list is reached. Hopefully, this will reduce the report generation time for large data sets.

  1. Private method List<Hit> fetchHitSelectionHitList(...) updated with two new arguments, boolean gelBasedComparison and Boolean includeModifications. The hit list for each selection will now be created by a single database query obtained from updated private method ItemQuery<Hit> createBasicHitQuery(...).
  2. Private method ItemQuery<Hit> createBasicHitQuery(...) updated to take lists of gel external id, local sample id, and fraction id as arguments, instead of specific values for the variables. The lists are used directly to create the query.
  3. Private methods void createQuantitativeReportPeptideTable(...) and void createQuantitativeReportProteinTable(...) updated to call new private method List<Hit> updateHitList(...) to create the hit list used for the report.
  4. New private method List<Hit> updateHitList(List<Hit> hitList, List<Hit> sortedHitList, boolean gelBasedComparison, String comparisonType, Boolean includeModifications, String comparisonValue, Integer charge) added. It updates the hit list with hits from sorted hit list. The hits to process are found by calling new private method int firstSortItemIndexInSortedHitList(...) to find the first index for given value for sorted hit list.
  5. New private method int firstSortItemIndexInSortedHitList(List<Hit> sortedHitList, boolean gelBasedComparison, String comparisonType, Boolean includeModifications, String comparisonValue) added. It uses a binary search algorithm to find the first index in the hit list for an item having a given value.
  6. New private method void createHitDebugTable(PrintWriter writer, List<Hit> hitList, String comparisonType, String quantityVariable) added for debug purposes. It is not used in production code, but left in the code base for convenience.

comment:9 Changed 11 years ago by olle

(In [4408]) Refs #790. Refs #736. Class/file plugins/HitsComparisonQuantitativeReportPlugin.java in plugin/ updated in public method void run(Request request, Response response, ProgressReporter progress) to set value of outfile string used in job completion comment. It was originally set in public method void doExport(DbControl dc, HitsComparisonData hcd, Directory outCoreDir, String filename, ProgressReporter progress), which was later removed.

comment:10 Changed 11 years ago by olle

(In [4409]) Refs #790. Refs #736. Class/file plugins/HitsComparisonQuantitativeReportPlugin.java in plugin/ updated to report progress percentage when generating the comparison report:

  1. Private methods void createQuantitativeReportPeptideTable(...) and void createQuantitativeReportProteinTable(...) updated with new argument ProgressReporter progress. Its display(int percent, String message) method is called to display an updated progress percentage whenever a new integer percentage value of the outer loop of the report generation occurs.

comment:11 Changed 11 years ago by olle

Resolution: fixed
Status: assignedclosed

Ticket closed as the added update hopefully will increase performance of quantitative hits comparison for large data sets.

Note: See TracTickets for help on using tickets.