rzo1 commented on PR #1952: URL: https://github.com/apache/stormcrawler/pull/1952#issuecomment-4716786147
Thanks for the PR. I see the need for it :) Two things worth addressing before merge: **1. bz2 concatenated streams are silently truncated.** `new BZip2CompressorInputStream(is)` only reads the first stream in a multi-stream bzip2 file. Concatenated bzip2 files are valid and common, for example when seed lists are built with `cat part1.bz2 part2.bz2 > seeds.bz2` or produced by parallel compressors like `pbzip2`/`lbzip2`. With the single-arg constructor, everything after the first stream is dropped silently (no exception, no warning), so the spout emits fewer seeds than the file contains. To reproduce: ```bash printf 'https://a.example\nhttps://b.example\n' | bzip2 > part1.bz2 printf 'https://c.example\nhttps://d.example\n' | bzip2 > part2.bz2 cat part1.bz2 part2.bz2 > seeds.bz2 bzip2 -dc seeds.bz2 # CLI reads all 4 URLs ``` The current code reads only `a` and `b` from that file. Note the asymmetry with the gzip path: `GZIPInputStream` decodes concatenated members automatically, so `.gz` returns all 4 URLs while `.bz2` returns 2 from equivalent input. Suggested fix, enable concatenated stream support: ```java is = new BZip2CompressorInputStream(is, /* decompressConcatenated */ true); ``` Single-stream `.bz2` files (the normal `bzip2 file.txt` case) are unaffected either way. The bug only affects multi-stream files, which are exactly what you tend to get when assembling the large seed lists this PR targets. **2. Missing BufferedInputStream on the compressed path.** `GZIPInputStream` (512-byte default read buffer) and `BZip2CompressorInputStream` both issue many small reads directly against the raw `FileInputStream`, one syscall each. On the uncompressed path this is fine because `InputStreamReader` has its own ~8 KB buffer between the file and the reader, but on the compressed path the decompressor sits in front of that, so there is nothing absorbing the small reads. For large compressed seed files this is a lot of tiny reads. Cheap fix, buffer the file before decompressing: ```java InputStream is = new BufferedInputStream(new FileInputStream(inputPath.toFile())); ``` Not a correctness issue, just I/O efficiency, but worth it given the large-seed-list motivation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
