What about using Apache Hudi o Apache Iceberg?

On Thu, Mar 4, 2021 at 10:15 AM Dawid Wysakowicz <dwysakow...@apache.org>
wrote:

> Hi,
>
> I know Jingsong worked on Flink/Hive filesystem integration in the
> Table/SQL API. Maybe he can shed some light on your questions.
>
> Best,
>
> Dawid
> On 02/03/2021 21:03, Theo Diefenthal wrote:
>
> Hi there,
>
> Currently, I have a Flink 1.11 job which writes parquet files via the
> StreamingFileSink to HDFS (simply using DataStream API). I commit like
> every 3 minutes and thus have many small files in HDFS. Downstream, the
> generated table is consumed from Spark Jobs and Impala queries. HDFS
> doesn't like to have too many small files and writing to parquet fast but
> also desiring large files is a rather common problem and solutions were
> suggested like recently in the mailing list [1] or in flink forward talks
> [2]. Cloudera also posted two possible scenarios in their blog posts [3],
> [4]. Mostly, it comes down to asynchronously compact the many small files
> into larger ones, at best non blocking and in an occasionally running batch
> job.
>
> I am now about to implement something like suggested in the cloudera blog
> [4] but from parquet to parquet. For me, it seems to be not straight
> forward but rather involved, especially as my data is partitioned in
> eventtime and I need the compaction to be non blocking (my users query
> impala and expect near real time performance in each query). When starting
> the work on that, I noticed that Hive already has a compaction mechanism
> included and the Flink community works a lot in terms of integrating with
> hive in the latest releases. Some of my questions are not directly related
> to Flink, but I guess many of you have also experience with hive and
> writing from Flink to Hive is rather common nowadays.
>
> I read online that Spark should integrate nicely with Hive tables, i.e.
> instead of querying HDFS files, querying a hive table has the same
> performance [5]. We also all know that Impala integrates nicely with Hive
> so that overall, I can expect writing to Hive internal tables instead of
> HDFS parquet doesn't have any disadvantages for me.
>
> My questions:
> 1. Can I use Flink to "streaming write" to Hive?
> 2. For compaction, I need "transactional tables" and according to the hive
> docs, transactional tables must be fully managed by hive (i.e., they are
> not external). Does Flink support writing to those out of the box? (I only
> have Hive 2 available)
> 3. Does Flink use the "Hive Streaming Data Ingest" APIs?
> 4. Do you see any downsides in writing to hive compared to writing to
> parquet directly? (Especially in my usecase only having impala and spark
> consumers)
> 5. Not Flink related: Have you ever experienced performance issues when
> using hive transactional tables over writing parquet directly? I guess
> there must be a reason why "transactional" is off by default in Hive? I
> won't use any features except for compaction, i.e. there are only streaming
> inserts, no updates, no deletes. (Delete only after given retention and
> always delete entire partitions)
>
>
> Best regards
> Theo
>
> [1]
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Streaming-data-to-parquet-td38029.html
> [2] https://www.youtube.com/watch?v=eOQ2073iWt4
> [3]
> https://blog.cloudera.com/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
> [4]
> https://blog.cloudera.com/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/
> [5]
> https://stackoverflow.com/questions/51190646/spark-dataset-on-hive-vs-parquet-file
>
>

Reply via email to