Hi Bill, Flink 1.6.0 supports writing Avro records as Parquet files to HDFS via the previously mentioned StreamingFileSink [1], [2].
Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-9753 [2] https://issues.apache.org/jira/browse/FLINK-9750 Am Fr., 28. Sep. 2018 um 23:36 Uhr schrieb hao gao <hao.x....@gmail.com>: > Hi Bill, > > I wrote those two medium posts you mentioned above. But clearly, the > techlab one is much better > I would suggest just "close the file when checkpointing" which is the > easiest way. If you use BucketingSink, you can modify the code to make it > work. Just replace the code from line 691 to 693 with > closeCurrentPartFile() > > https://github.com/apache/flink/blob/release-1.3.2-rc1/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L691 > This should guarantee exactly-once. You may have some files with > underscore prefix when flink job failed. But usually those files are > ignored by the query engine/ readers for example, Presto > > If you use 1.6 and later, I think the issue is already addressed > https://issues.apache.org/jira/browse/FLINK-9750 > > Thanks > Hao > > On Fri, Sep 28, 2018 at 1:57 PM William Speirs <wspe...@apache.org> wrote: > >> I'm trying to stream log messages (syslog fed into Kafak) into Parquet >> files on HDFS via Flink. I'm able to read, parse, and construct objects for >> my messages in Flink; however, writing to Parquet is tripping me up. I do >> *not* need to have this be real-time; a delay of a few minutes, even up to >> an hour, is fine. >> >> I've found the following articles talking about this being very difficult: >> * >> https://medium.com/hadoop-noob/a-realtime-flink-parquet-data-warehouse-df8c3bd7401 >> * https://medium.com/hadoop-noob/flink-parquet-writer-d127f745b519 >> * >> https://techlab.bol.com/how-not-to-sink-a-data-stream-to-files-journeys-from-kafka-to-parquet/ >> * >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Rolling-sink-parquet-Avro-output-td11123.html >> >> All of these posts speak of troubles using the check-pointing mechanisms >> and Parquets need to perform batch writes. I'm not experienced enough with >> Flink's check-pointing or Parquet's file format to completely understand >> the issue. So my questions are as follows: >> >> 1) Is this possible in Flink in an exactly-once way? If not, is it >> possible in a way that _might_ cause duplicates during an error? >> >> 2) Is there another/better format to use other than Parquet that offers >> compression and the ability to be queried by something like Drill or Impala? >> >> 3) Any further recommendations for solving the overall problem: ingesting >> syslogs and writing them to a file(s) that is searchable by an SQL(-like) >> framework? >> >> Thanks! >> >> Bill- >> > > > -- > Thanks > - Hao >