Hi,

I know Jingsong worked on Flink/Hive filesystem integration in the
Table/SQL API. Maybe he can shed some light on your questions.

Best,

Dawid

On 02/03/2021 21:03, Theo Diefenthal wrote:
> Hi there,
>
> Currently, I have a Flink 1.11 job which writes parquet files via the
> StreamingFileSink to HDFS (simply using DataStream API). I commit like
> every 3 minutes and thus have many small files in HDFS. Downstream,
> the generated table is consumed from Spark Jobs and Impala queries.
> HDFS doesn't like to have too many small files and writing to parquet
> fast but also desiring large files is a rather common problem and
> solutions were suggested like recently in the mailing list [1] or in
> flink forward talks [2]. Cloudera also posted two possible scenarios
> in their blog posts [3], [4]. Mostly, it comes down to asynchronously
> compact the many small files into larger ones, at best non blocking
> and in an occasionally running batch job.
>
> I am now about to implement something like suggested in the cloudera
> blog [4] but from parquet to parquet. For me, it seems to be not
> straight forward but rather involved, especially as my data is
> partitioned in eventtime and I need the compaction to be non blocking
> (my users query impala and expect near real time performance in each
> query). When starting the work on that, I noticed that Hive already
> has a compaction mechanism included and the Flink community works a
> lot in terms of integrating with hive in the latest releases. Some of
> my questions are not directly related to Flink, but I guess many of
> you have also experience with hive and writing from Flink to Hive is
> rather common nowadays.
>
> I read online that Spark should integrate nicely with Hive tables,
> i.e. instead of querying HDFS files, querying a hive table has the
> same performance [5]. We also all know that Impala integrates nicely
> with Hive so that overall, I can expect writing to Hive internal
> tables instead of HDFS parquet doesn't have any disadvantages for me.
>
> My questions:
> 1. Can I use Flink to "streaming write" to Hive?
> 2. For compaction, I need "transactional tables" and according to the
> hive docs, transactional tables must be fully managed by hive (i.e.,
> they are not external). Does Flink support writing to those out of the
> box? (I only have Hive 2 available)
> 3. Does Flink use the "Hive Streaming Data Ingest" APIs?
> 4. Do you see any downsides in writing to hive compared to writing to
> parquet directly? (Especially in my usecase only having impala and
> spark consumers)
> 5. Not Flink related: Have you ever experienced performance issues
> when using hive transactional tables over writing parquet directly? I
> guess there must be a reason why "transactional" is off by default in
> Hive? I won't use any features except for compaction, i.e. there are
> only streaming inserts, no updates, no deletes. (Delete only after
> given retention and always delete entire partitions)
>
>
> Best regards
> Theo
>
> [1]
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Streaming-data-to-parquet-td38029.html
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Streaming-data-to-parquet-td38029.html>
> [2] https://www.youtube.com/watch?v=eOQ2073iWt4
> <https://www.youtube.com/watch?v=eOQ2073iWt4>
> [3]
> https://blog.cloudera.com/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
> <https://blog.cloudera.com/how-to-ingest-and-query-fast-data-with-impala-without-kudu/>
>
> [4]
> https://blog.cloudera.com/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/
> <https://blog.cloudera.com/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/>
>
> [5]
> https://stackoverflow.com/questions/51190646/spark-dataset-on-hive-vs-parquet-file
> <https://stackoverflow.com/questions/51190646/spark-dataset-on-hive-vs-parquet-file>

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to