Re: Ingestion of data into HDFS

Fabian Hueske Fri, 22 May 2015 09:22:05 -0700

I'm not sure if I got your question right.

Do you want to know if it is possible to implement a Flink program that
reads several files and writes their data into a Parquet format?
Or are you asking how such a job could be scheduled for execution based on
some external event (such as a file appearing)?


Both should be possible.

The job would be a simple pipeline with or without some transformations
depending on the required logic and a Parquet data sink.
The job execution can be triggered from outside of Flink for example using
a monitoring process or a cron job that calls the CLI client with the right
parameters.

Best, Fabian



2015-05-22 14:55 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:

> Hi to all,
>
> in my use case I have bursts of data to store into hdfs and once finished,
> compact them into a single directory (as Parquet). From what I know, the
> current approach is to use Flume that automatically ingest data and compact
> them based on some configurable policy.
> However I'd like to avoid to add Flume to my architecture because these
> bursts are not long lived processed so I just want to write a batch of rows
> as a single file in some directory, and once the process finish, i want to
> read all of them and compact into a single output directory as Parquet.
> It's something similar to a streaming process but (for the moment) I'd
> like to avoid to have a long lived Flink process listening for incoming
> data.
>
> Do you have any suggestion for such a process or is there any example in
> Flink code?
>
>
> Best,
> Flavio
>

Re: Ingestion of data into HDFS

Reply via email to