Opened 7 years ago

Closed 7 years ago

#2006 closed enhancement (fixed)

Use parallel gzip implementation when compressing files

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: BASE 3.9
Component: core Version:
Keywords: Cc:

Description

Just found this: https://github.com/shevek/parallelgzip

Which seems interesting since we already know that pigz is a good performance booster: http://baseplugins.thep.lu.se/ticket/809#comment:5

I think we should try this out for the "Store compressed" option when saving files to the BASE file system. It may not improve things when loading files over the netweork, but there are cases when files are created locally.

The release export wizard developed for reggie (http://baseplugins.thep.lu.se/ticket/887) can generate files several GB in size. On my dev computer the throughput when storing compressed is 3Mb/s and 20Mb/s when storing uncompressed.

Attachments (1)

pgzip-test.png (15.4 KB) - added by Nicklas Nordborg 7 years ago.
Using 7z to test a file that was created with PGZip

Download all attachments as: .zip

Change History (12)

comment:1 Changed 7 years ago by Nicklas Nordborg

Owner: changed from everyone to Nicklas Nordborg
Status: newassigned

comment:2 Changed 7 years ago by Nicklas Nordborg

Tested the parallel implementation with the release exporter. Throughput is up to 8MB/s. Considering that the compressed file size is about 25% of the uncompressed this means that we are gaining time also over the uncompressed alternative.

comment:3 Changed 7 years ago by Nicklas Nordborg

Resolution: fixed
Status: assignedclosed

(In [7152]) Fixes #2006: Use parallel gzip implementation when compressing files

comment:4 Changed 7 years ago by Nicklas Nordborg

Reopened due to #2016.

comment:5 Changed 7 years ago by Nicklas Nordborg

Resolution: fixed
Status: closedreopened

Changed 7 years ago by Nicklas Nordborg

Attachment: pgzip-test.png added

Using 7z to test a file that was created with PGZip

comment:6 Changed 7 years ago by Nicklas Nordborg

Digging up the file from the internal BASE storage and testing it with 7z results in an error: Using 7z to test a file that was created with PGZip

comment:7 Changed 7 years ago by Nicklas Nordborg

In theory it should be possible to handle the error while reading the file since we can compare the actual bytes that has been read with the known file size of the original file. Any errors that happens after that can be ignored. For example, wrapping the GZipInputStream with something like this seems to work:

@Override
public int read(byte[] buf, int start, int len)
  throws IOException
{
  try
  {
    return super.read(buf, start, len);
  }
  catch (EOFException ex)
  {
    if (inf.getBytesWritten() != getSize()) throw ex;
  }
  return -1;
}

However, personally I am not so happy about this solution which is more or less a "hack" for working around the problem with creating corrupt files to begin with. I think we should either abandon the PGZip implementation or try to fix the writing of the file.

comment:8 Changed 7 years ago by Nicklas Nordborg

(In [7170]) References #2006 and #2016.

Removed parallelgzip jar file and added source files to the BASE core package instead. The intention is to fix the file size problem. The current code is the original code as downloaded from https://github.com/shevek/parallelgzip (version 1.0.1). The code does not compile due to using non-standard annotation from "javax.annotation" package.

comment:9 Changed 7 years ago by Nicklas Nordborg

(In [7171]) References #2006 and #2016.

Fixed compilation errors by removing @Nonnull and @Nonnegative annotations.

comment:10 Changed 7 years ago by Nicklas Nordborg

(In [7172]) References #2006 and #2016.

Changes the bytesWritten varible from int to long. This seems to trigger a proper close(). The saved file is 10 bytes larger than before and there are no errors when reading it from BASE or when testing it with 7z.

comment:11 Changed 7 years ago by Nicklas Nordborg

Resolution: fixed
Status: reopenedclosed
Note: See TracTickets for help on using tickets.