rzo1 commented on PR #1952:
URL: https://github.com/apache/stormcrawler/pull/1952#issuecomment-4716786147

   Thanks for the PR. I see the need for it :)
   
   Two things worth addressing before merge:
   
   **1. bz2 concatenated streams are silently truncated.**
   
   `new BZip2CompressorInputStream(is)` only reads the first stream in a 
multi-stream bzip2 file. Concatenated bzip2 files are valid and common, for 
example when seed lists are built with `cat part1.bz2 part2.bz2 > seeds.bz2` or 
produced by parallel compressors like `pbzip2`/`lbzip2`. With the single-arg 
constructor, everything after the first stream is dropped silently (no 
exception, no warning), so the spout emits fewer seeds than the file contains.
   
   To reproduce:
   
   ```bash
   printf 'https://a.example\nhttps://b.example\n' | bzip2 > part1.bz2
   printf 'https://c.example\nhttps://d.example\n' | bzip2 > part2.bz2
   cat part1.bz2 part2.bz2 > seeds.bz2
   
   bzip2 -dc seeds.bz2   # CLI reads all 4 URLs
   ```
   
   The current code reads only `a` and `b` from that file.
   
   Note the asymmetry with the gzip path: `GZIPInputStream` decodes 
concatenated members automatically, so `.gz` returns all 4 URLs while `.bz2` 
returns 2 from equivalent input.
   
   Suggested fix, enable concatenated stream support:
   
   ```java
   is = new BZip2CompressorInputStream(is, /* decompressConcatenated */ true);
   ```
   
   Single-stream `.bz2` files (the normal `bzip2 file.txt` case) are unaffected 
either way. The bug only affects multi-stream files, which are exactly what you 
tend to get when assembling the large seed lists this PR targets.
   
   **2. Missing BufferedInputStream on the compressed path.**
   
   `GZIPInputStream` (512-byte default read buffer) and 
`BZip2CompressorInputStream` both issue many small reads directly against the 
raw `FileInputStream`, one syscall each. On the uncompressed path this is fine 
because `InputStreamReader` has its own ~8 KB buffer between the file and the 
reader, but on the compressed path the decompressor sits in front of that, so 
there is nothing absorbing the small reads. For large compressed seed files 
this is a lot of tiny reads.
   
   Cheap fix, buffer the file before decompressing:
   
   ```java
   InputStream is = new BufferedInputStream(new 
FileInputStream(inputPath.toFile()));
   ```
   
   Not a correctness issue, just I/O efficiency, but worth it given the 
large-seed-list motivation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to