Re: Streaming to Parquet Files in HDFS

Fabian Hueske Mon, 01 Oct 2018 01:53:27 -0700

Hi Bill,

Flink 1.6.0 supports writing Avro records as Parquet files to HDFS via the
previously mentioned StreamingFileSink [1], [2].


Best, Fabian

[1] https://issues.apache.org/jira/browse/FLINK-9753
[2] https://issues.apache.org/jira/browse/FLINK-9750

Am Fr., 28. Sep. 2018 um 23:36 Uhr schrieb hao gao <hao.x....@gmail.com>:

> Hi Bill,
>
> I wrote those two medium posts you mentioned above. But clearly, the
> techlab one is much better
> I would suggest just "close the file when checkpointing" which is the
> easiest way. If you use BucketingSink, you can modify the code to make it
> work. Just replace the code from line 691 to 693 with
> closeCurrentPartFile()
>
> https://github.com/apache/flink/blob/release-1.3.2-rc1/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L691
> This should guarantee exactly-once. You may have some files with
> underscore prefix when flink job failed. But usually those files are
> ignored by the query engine/ readers for example,  Presto
>
> If you use 1.6 and later, I think the issue is already addressed
> https://issues.apache.org/jira/browse/FLINK-9750
>
> Thanks
> Hao
>
> On Fri, Sep 28, 2018 at 1:57 PM William Speirs <wspe...@apache.org> wrote:
>
>> I'm trying to stream log messages (syslog fed into Kafak) into Parquet
>> files on HDFS via Flink. I'm able to read, parse, and construct objects for
>> my messages in Flink; however, writing to Parquet is tripping me up. I do
>> *not* need to have this be real-time; a delay of a few minutes, even up to
>> an hour, is fine.
>>
>> I've found the following articles talking about this being very difficult:
>> *
>> https://medium.com/hadoop-noob/a-realtime-flink-parquet-data-warehouse-df8c3bd7401
>> * https://medium.com/hadoop-noob/flink-parquet-writer-d127f745b519
>> *
>> https://techlab.bol.com/how-not-to-sink-a-data-stream-to-files-journeys-from-kafka-to-parquet/
>> *
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Rolling-sink-parquet-Avro-output-td11123.html
>>
>> All of these posts speak of troubles using the check-pointing mechanisms
>> and Parquets need to perform batch writes. I'm not experienced enough with
>> Flink's check-pointing or Parquet's file format to completely understand
>> the issue. So my questions are as follows:
>>
>> 1) Is this possible in Flink in an exactly-once way? If not, is it
>> possible in a way that _might_ cause duplicates during an error?
>>
>> 2) Is there another/better format to use other than Parquet that offers
>> compression and the ability to be queried by something like Drill or Impala?
>>
>> 3) Any further recommendations for solving the overall problem: ingesting
>> syslogs and writing them to a file(s) that is searchable by an SQL(-like)
>> framework?
>>
>> Thanks!
>>
>> Bill-
>>
>
>
> --
> Thanks
>  - Hao
>

Re: Streaming to Parquet Files in HDFS

Reply via email to