Hi Stephan (switching to dev list), > On Aug 29, 2019, at 2:52 AM, Stephan Ewen <se...@apache.org> wrote: > > That is a good point. > > Which way would you suggest to go? Not relying on the FS block size at all, > but using a fix (configurable) block size?
There’s value to not requiring a fixed block size, as then a file that’s moved between different file systems can be read using whatever block size is optimal for that system. Hadoop handles this in sequence files by storing a unique “sync marker” value in the file header (essentially a 16 byte UUID), injecting one of these every 2K bytes or so (in between records), and then code can scan for this to find record boundaries without relying on a block size. The idea is that 2^128 is a Big Number, so the odds of finding a false-positive sync marker in data is low enough to be ignorable. But that’s a bigger change. Simpler would be to put a header in each part file being written, with some signature bytes to aid in detecting old-format files. Or maybe deprecate SerializedOutputFormat/SerializedInputFormat, and provide some wrapper glue to make it easier to write/read Hadoop SequenceFiles that have a null key value, and store the POJO as the data value. Then you could also leverage Hadoop support for compression at either record or block level. — Ken > > On Thu, Aug 29, 2019 at 4:49 AM Ken Krugler <kkrugler_li...@transpac.com > <mailto:kkrugler_li...@transpac.com>> wrote: > Hi all, > > Wondering if anyone else has run into this. > > We write files to S3 using the SerializedOutputFormat<OurCustomPOJO>. When we > read them back, sometimes we get deserialization errors where the data seems > to be corrupt. > > After a lot of logging, the weathervane of blame pointed towards the block > size somehow not being the same between the write (where it’s 64MB) and the > read (unknown). > > When I added a call to SerializedInputFormat.setBlockSize(64MB), the problems > went away. > > It looks like both input and output formats use fs.getDefaultBlockSize() to > set this value by default, so maybe the root issue is S3 somehow reporting > different values. > > But it does feel a bit odd that we’re relying on this default setting, versus > it being recorded in the file during the write phase. > > And it’s awkward to try to set the block size on the write, as you need to > set it in the environment conf, which means it applies to all output files in > the job. > > — Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra