Re: Streaming to Parquet Files in HDFS

2018-10-11 Thread Averell
Hi Kostas, Thanks for the info. That error caused by I built your code along with not up-to-date baseline. I rebased my branch build, and there's no more such issue. I've been testing, and until now have some questions/issues as below: 1. I'm not able to write to S3 with the following URI format:

Re: Streaming to Parquet Files in HDFS

2018-10-07 Thread Kostas Kloudas
Hi, Yes, please enable DEBUG to streaming to see all the logs also from the StreamTask. A checkpoint is “valid” as soon as it get acknowledged. As the documentation says, the job will restart from “ the last **successful** checkpoint” which is the most recent acknowledged one. Cheers, Kostas

Re: Streaming to Parquet Files in HDFS

2018-10-07 Thread Averell
Hi Kostas, Yes, I set the level to DEBUG, but for the /org.apache.flink.streaming.api.functions.sink.filesystem.bucket/ only. Will try to enable for /org.apache.flink.streaming/. I just found one (possibly) issue with my build is that I had not used the latest master branch when merging with your

Re: Streaming to Parquet Files in HDFS

2018-10-07 Thread Kostas Kloudas
Hi, I just saw that you have already set the level to DEBUG. These are all your DEBUG logs of the TM when running on YARN? Also did you try to wait a bit more to see if the acknowledgements of the checkpoints arrive a bit later? Checkpoints and acknowledgments are not necessarily aligned. Kost

Re: Streaming to Parquet Files in HDFS

2018-10-07 Thread Kostas Kloudas
Hi Averell, Could you set your logging to DEBUG? This may shed some light on what is happening as it will contain more logs. Kostas > On Oct 7, 2018, at 11:03 AM, Averell wrote: > > Hi Kostas, > > I'm using a build with your PR. However, it seemed the issue is not with S3, > as when I tried t

Re: Streaming to Parquet Files in HDFS

2018-10-07 Thread Averell
Hi Kostas, I'm using a build with your PR. However, it seemed the issue is not with S3, as when I tried to write to local file system (file:///, not HDFS), I also got the same problem - only the first part published. All remaining parts were in inprogress and had names prefixed with "." >From Fli

Re: Streaming to Parquet Files in HDFS

2018-10-07 Thread Kostas Kloudas
Hi Averell, From the logs, only checkpoint 2 was acknowledged (search for “eceived completion notification for checkpoint with id=“) and this is why no more files are finalized. So only checkpoint 2 was successfully completed. BTW you are using the PR you mentioned before or Flink 1.6? I am as

Re: Streaming to Parquet Files in HDFS

2018-10-06 Thread Averell
Hi Kostas, Please help ignore my previous email about the issue with security. It seems to I had mixed version of shaded and unshaded jars. However, I'm now facing another issue with writing parquet files: only the first part is closed. All the subsequent parts are kept in in-progress state forev

Re: Streaming to Parquet Files in HDFS

2018-10-05 Thread Averell
Hi Kostas, I tried your PR - trying to write to S3 from Flink running on AWS, and I got the following error. I copied the three jar files flink-hadoop-fs-1.7-SNAPSHOT.jar, flink-s3-fs-base-1.7-SNAPSHOT.jar, flink-s3-fs-hadoop-1.7-SNAPSHOT.jar to lib/ directory. Do I need to make any change to HADO

Re: Streaming to Parquet Files in HDFS

2018-10-05 Thread Averell
What a great news. Thanks for that, Kostas. Regards, Averell -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Streaming to Parquet Files in HDFS

2018-10-05 Thread Kostas Kloudas
Hi Averell, There is no such “out-of-the-box” solution, but there is an open PR for adding S3 support to the StreamingFileSink [1]. Cheers, Kostas [1] https://github.com/apache/flink/pull/6795 > On Oct 5, 2018, at 11:14 AM, Averell wrote: > > Hi K

Re: Streaming to Parquet Files in HDFS

2018-10-05 Thread Averell
Hi Kostas, Thanks for the info. Just one more question regarding writing parquet. I need to write my stream as parquet to S3. As per this ticket https://issues.apache.org/jira/browse/FLINK-9752 , it is now not supported. Is there any ready-to-us

Re: Streaming to Parquet Files in HDFS

2018-10-05 Thread Kostas Kloudas
Hi Averell, You are right that for Bulk Formats like Parquet, we roll on every checkpoint. This is currently a limitation that has to do with the fact that bulk formats gather and rely on metadata that they keep internally and which we cannot checkpoint in Flink,as they do not expose them. Set

Re: Streaming to Parquet Files in HDFS

2018-10-04 Thread Averell
Hi Fabian, Kostas, >From the description of this ticket https://issues.apache.org/jira/browse/FLINK-9753, I understand that now my output parquet file with StreamingFileSink will span multiple checkpoints. However, when I tried (as in the here below code snippet) I still see that one "part-X-X" fi

Re: Streaming to Parquet Files in HDFS

2018-10-01 Thread Biswajit Das
Nice to see this finally! On Mon, Oct 1, 2018 at 1:53 AM Fabian Hueske wrote: > Hi Bill, > > Flink 1.6.0 supports writing Avro records as Parquet files to HDFS via the > previously mentioned StreamingFileSink [1], [2]. > > Best, Fabian > > [1] https://issues.apache.org/jira/browse/FLINK-9753 > [

Re: Streaming to Parquet Files in HDFS

2018-10-01 Thread Fabian Hueske
Hi Bill, Flink 1.6.0 supports writing Avro records as Parquet files to HDFS via the previously mentioned StreamingFileSink [1], [2]. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-9753 [2] https://issues.apache.org/jira/browse/FLINK-9750 Am Fr., 28. Sep. 2018 um 23:36 Uhr schrieb

Re: Streaming to Parquet Files in HDFS

2018-09-28 Thread hao gao
Hi Bill, I wrote those two medium posts you mentioned above. But clearly, the techlab one is much better I would suggest just "close the file when checkpointing" which is the easiest way. If you use BucketingSink, you can modify the code to make it work. Just replace the code from line 691 to 693

Streaming to Parquet Files in HDFS

2018-09-28 Thread William Speirs
I'm trying to stream log messages (syslog fed into Kafak) into Parquet files on HDFS via Flink. I'm able to read, parse, and construct objects for my messages in Flink; however, writing to Parquet is tripping me up. I do *not* need to have this be real-time; a delay of a few minutes, even up to an