What about using Apache Hudi o Apache Iceberg? On Thu, Mar 4, 2021 at 10:15 AM Dawid Wysakowicz <dwysakow...@apache.org> wrote:
> Hi, > > I know Jingsong worked on Flink/Hive filesystem integration in the > Table/SQL API. Maybe he can shed some light on your questions. > > Best, > > Dawid > On 02/03/2021 21:03, Theo Diefenthal wrote: > > Hi there, > > Currently, I have a Flink 1.11 job which writes parquet files via the > StreamingFileSink to HDFS (simply using DataStream API). I commit like > every 3 minutes and thus have many small files in HDFS. Downstream, the > generated table is consumed from Spark Jobs and Impala queries. HDFS > doesn't like to have too many small files and writing to parquet fast but > also desiring large files is a rather common problem and solutions were > suggested like recently in the mailing list [1] or in flink forward talks > [2]. Cloudera also posted two possible scenarios in their blog posts [3], > [4]. Mostly, it comes down to asynchronously compact the many small files > into larger ones, at best non blocking and in an occasionally running batch > job. > > I am now about to implement something like suggested in the cloudera blog > [4] but from parquet to parquet. For me, it seems to be not straight > forward but rather involved, especially as my data is partitioned in > eventtime and I need the compaction to be non blocking (my users query > impala and expect near real time performance in each query). When starting > the work on that, I noticed that Hive already has a compaction mechanism > included and the Flink community works a lot in terms of integrating with > hive in the latest releases. Some of my questions are not directly related > to Flink, but I guess many of you have also experience with hive and > writing from Flink to Hive is rather common nowadays. > > I read online that Spark should integrate nicely with Hive tables, i.e. > instead of querying HDFS files, querying a hive table has the same > performance [5]. We also all know that Impala integrates nicely with Hive > so that overall, I can expect writing to Hive internal tables instead of > HDFS parquet doesn't have any disadvantages for me. > > My questions: > 1. Can I use Flink to "streaming write" to Hive? > 2. For compaction, I need "transactional tables" and according to the hive > docs, transactional tables must be fully managed by hive (i.e., they are > not external). Does Flink support writing to those out of the box? (I only > have Hive 2 available) > 3. Does Flink use the "Hive Streaming Data Ingest" APIs? > 4. Do you see any downsides in writing to hive compared to writing to > parquet directly? (Especially in my usecase only having impala and spark > consumers) > 5. Not Flink related: Have you ever experienced performance issues when > using hive transactional tables over writing parquet directly? I guess > there must be a reason why "transactional" is off by default in Hive? I > won't use any features except for compaction, i.e. there are only streaming > inserts, no updates, no deletes. (Delete only after given retention and > always delete entire partitions) > > > Best regards > Theo > > [1] > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Streaming-data-to-parquet-td38029.html > [2] https://www.youtube.com/watch?v=eOQ2073iWt4 > [3] > https://blog.cloudera.com/how-to-ingest-and-query-fast-data-with-impala-without-kudu/ > [4] > https://blog.cloudera.com/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/ > [5] > https://stackoverflow.com/questions/51190646/spark-dataset-on-hive-vs-parquet-file > >