Sounds reasonable.

I am adding Arvid to the thread - IIRC he authored that tool in his
Stratosphere days. And my a stroke of luck, he is now working on Flink
again.

@Arvid - what are your thoughts on Ken's suggestions?

On Fri, Aug 30, 2019 at 8:56 PM Ken Krugler <kkrugler_li...@transpac.com>
wrote:

> Hi Stephan (switching to dev list),
>
> On Aug 29, 2019, at 2:52 AM, Stephan Ewen <se...@apache.org> wrote:
>
> That is a good point.
>
> Which way would you suggest to go? Not relying on the FS block size at
> all, but using a fix (configurable) block size?
>
>
> There’s value to not requiring a fixed block size, as then a file that’s
> moved between different file systems can be read using whatever block size
> is optimal for that system.
>
> Hadoop handles this in sequence files by storing a unique “sync marker”
> value in the file header (essentially a 16 byte UUID), injecting one of
> these every 2K bytes or so (in between records), and then code can scan for
> this to find record boundaries without relying on a block size. The idea is
> that 2^128 is a Big Number, so the odds of finding a false-positive sync
> marker in data is low enough to be ignorable.
>
> But that’s a bigger change. Simpler would be to put a header in each part
> file being written, with some signature bytes to aid in detecting
> old-format files.
>
> Or maybe deprecate SerializedOutputFormat/SerializedInputFormat, and
> provide some wrapper glue to make it easier to write/read Hadoop
> SequenceFiles that have a null key value, and store the POJO as the data
> value. Then you could also leverage Hadoop support for compression at
> either record or block level.
>
> — Ken
>
>
> On Thu, Aug 29, 2019 at 4:49 AM Ken Krugler <kkrugler_li...@transpac.com>
> wrote:
>
>> Hi all,
>>
>> Wondering if anyone else has run into this.
>>
>> We write files to S3 using the SerializedOutputFormat<OurCustomPOJO>.
>> When we read them back, sometimes we get deserialization errors where the
>> data seems to be corrupt.
>>
>> After a lot of logging, the weathervane of blame pointed towards the
>> block size somehow not being the same between the write (where it’s 64MB)
>> and the read (unknown).
>>
>> When I added a call to SerializedInputFormat.setBlockSize(64MB), the
>> problems went away.
>>
>> It looks like both input and output formats use fs.getDefaultBlockSize()
>> to set this value by default, so maybe the root issue is S3 somehow
>> reporting different values.
>>
>> But it does feel a bit odd that we’re relying on this default setting,
>> versus it being recorded in the file during the write phase.
>>
>> And it’s awkward to try to set the block size on the write, as you need
>> to set it in the environment conf, which means it applies to all output
>> files in the job.
>>
>> — Ken
>>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
>

Reply via email to