Hi Stephan (switching to dev list),

> On Aug 29, 2019, at 2:52 AM, Stephan Ewen <se...@apache.org> wrote:
> 
> That is a good point.
> 
> Which way would you suggest to go? Not relying on the FS block size at all, 
> but using a fix (configurable) block size?

There’s value to not requiring a fixed block size, as then a file that’s moved 
between different file systems can be read using whatever block size is optimal 
for that system.

Hadoop handles this in sequence files by storing a unique “sync marker” value 
in the file header (essentially a 16 byte UUID), injecting one of these every 
2K bytes or so (in between records), and then code can scan for this to find 
record boundaries without relying on a block size. The idea is that 2^128 is a 
Big Number, so the odds of finding a false-positive sync marker in data is low 
enough to be ignorable.

But that’s a bigger change. Simpler would be to put a header in each part file 
being written, with some signature bytes to aid in detecting old-format files.

Or maybe deprecate SerializedOutputFormat/SerializedInputFormat, and provide 
some wrapper glue to make it easier to write/read Hadoop SequenceFiles that 
have a null key value, and store the POJO as the data value. Then you could 
also leverage Hadoop support for compression at either record or block level.

— Ken

> 
> On Thu, Aug 29, 2019 at 4:49 AM Ken Krugler <kkrugler_li...@transpac.com 
> <mailto:kkrugler_li...@transpac.com>> wrote:
> Hi all,
> 
> Wondering if anyone else has run into this.
> 
> We write files to S3 using the SerializedOutputFormat<OurCustomPOJO>. When we 
> read them back, sometimes we get deserialization errors where the data seems 
> to be corrupt.
> 
> After a lot of logging, the weathervane of blame pointed towards the block 
> size somehow not being the same between the write (where it’s 64MB) and the 
> read (unknown).
> 
> When I added a call to SerializedInputFormat.setBlockSize(64MB), the problems 
> went away.
> 
> It looks like both input and output formats use fs.getDefaultBlockSize() to 
> set this value by default, so maybe the root issue is S3 somehow reporting 
> different values.
> 
> But it does feel a bit odd that we’re relying on this default setting, versus 
> it being recorded in the file during the write phase.
> 
> And it’s awkward to try to set the block size on the write, as you need to 
> set it in the environment conf, which means it applies to all output files in 
> the job.
> 
> — Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Reply via email to