Re: Potential block size issue with S3 binary files

2019-09-04 Thread Arvid Heise
Hi Ken, as far as I understood, you are using the format to overcome some short comings in Flink. There is no need to even look at the data or even to create it if the join would work decently. If so, then it would make sense to keep the format, as I expect similar issues to always appear and pro

Re: Potential block size issue with S3 binary files

2019-09-03 Thread Ken Krugler
Hi Arvid, Thanks for following up… > On Sep 2, 2019, at 3:09 AM, Arvid Heise wrote: > > Hi Ken, > > that's indeed a very odd issue that you found. I had a hard time to connect > block size with S3 in the beginning and had to dig into the code. I still > cannot fully understand why you got two

Re: Potential block size issue with S3 binary files

2019-09-02 Thread Arvid Heise
Hi Ken, that's indeed a very odd issue that you found. I had a hard time to connect block size with S3 in the beginning and had to dig into the code. I still cannot fully understand why you got two different block size values from the S3 FileSytem. Looking into Hadoop code, I found the following s

Re: Potential block size issue with S3 binary files

2019-09-01 Thread Stephan Ewen
Sounds reasonable. I am adding Arvid to the thread - IIRC he authored that tool in his Stratosphere days. And my a stroke of luck, he is now working on Flink again. @Arvid - what are your thoughts on Ken's suggestions? On Fri, Aug 30, 2019 at 8:56 PM Ken Krugler wrote: > Hi Stephan (switching

Re: Potential block size issue with S3 binary files

2019-08-30 Thread Ken Krugler
Hi Stephan (switching to dev list), > On Aug 29, 2019, at 2:52 AM, Stephan Ewen wrote: > > That is a good point. > > Which way would you suggest to go? Not relying on the FS block size at all, > but using a fix (configurable) block size? There’s value to not requiring a fixed block size, as t