Sounds reasonable. I am adding Arvid to the thread - IIRC he authored that tool in his Stratosphere days. And my a stroke of luck, he is now working on Flink again.
@Arvid - what are your thoughts on Ken's suggestions? On Fri, Aug 30, 2019 at 8:56 PM Ken Krugler <kkrugler_li...@transpac.com> wrote: > Hi Stephan (switching to dev list), > > On Aug 29, 2019, at 2:52 AM, Stephan Ewen <se...@apache.org> wrote: > > That is a good point. > > Which way would you suggest to go? Not relying on the FS block size at > all, but using a fix (configurable) block size? > > > There’s value to not requiring a fixed block size, as then a file that’s > moved between different file systems can be read using whatever block size > is optimal for that system. > > Hadoop handles this in sequence files by storing a unique “sync marker” > value in the file header (essentially a 16 byte UUID), injecting one of > these every 2K bytes or so (in between records), and then code can scan for > this to find record boundaries without relying on a block size. The idea is > that 2^128 is a Big Number, so the odds of finding a false-positive sync > marker in data is low enough to be ignorable. > > But that’s a bigger change. Simpler would be to put a header in each part > file being written, with some signature bytes to aid in detecting > old-format files. > > Or maybe deprecate SerializedOutputFormat/SerializedInputFormat, and > provide some wrapper glue to make it easier to write/read Hadoop > SequenceFiles that have a null key value, and store the POJO as the data > value. Then you could also leverage Hadoop support for compression at > either record or block level. > > — Ken > > > On Thu, Aug 29, 2019 at 4:49 AM Ken Krugler <kkrugler_li...@transpac.com> > wrote: > >> Hi all, >> >> Wondering if anyone else has run into this. >> >> We write files to S3 using the SerializedOutputFormat<OurCustomPOJO>. >> When we read them back, sometimes we get deserialization errors where the >> data seems to be corrupt. >> >> After a lot of logging, the weathervane of blame pointed towards the >> block size somehow not being the same between the write (where it’s 64MB) >> and the read (unknown). >> >> When I added a call to SerializedInputFormat.setBlockSize(64MB), the >> problems went away. >> >> It looks like both input and output formats use fs.getDefaultBlockSize() >> to set this value by default, so maybe the root issue is S3 somehow >> reporting different values. >> >> But it does feel a bit odd that we’re relying on this default setting, >> versus it being recorded in the file during the write phase. >> >> And it’s awkward to try to set the block size on the write, as you need >> to set it in the environment conf, which means it applies to all output >> files in the job. >> >> — Ken >> > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > Custom big data solutions & training > Flink, Solr, Hadoop, Cascading & Cassandra > >