Hi, I know Jingsong worked on Flink/Hive filesystem integration in the Table/SQL API. Maybe he can shed some light on your questions.
Best, Dawid On 02/03/2021 21:03, Theo Diefenthal wrote: > Hi there, > > Currently, I have a Flink 1.11 job which writes parquet files via the > StreamingFileSink to HDFS (simply using DataStream API). I commit like > every 3 minutes and thus have many small files in HDFS. Downstream, > the generated table is consumed from Spark Jobs and Impala queries. > HDFS doesn't like to have too many small files and writing to parquet > fast but also desiring large files is a rather common problem and > solutions were suggested like recently in the mailing list [1] or in > flink forward talks [2]. Cloudera also posted two possible scenarios > in their blog posts [3], [4]. Mostly, it comes down to asynchronously > compact the many small files into larger ones, at best non blocking > and in an occasionally running batch job. > > I am now about to implement something like suggested in the cloudera > blog [4] but from parquet to parquet. For me, it seems to be not > straight forward but rather involved, especially as my data is > partitioned in eventtime and I need the compaction to be non blocking > (my users query impala and expect near real time performance in each > query). When starting the work on that, I noticed that Hive already > has a compaction mechanism included and the Flink community works a > lot in terms of integrating with hive in the latest releases. Some of > my questions are not directly related to Flink, but I guess many of > you have also experience with hive and writing from Flink to Hive is > rather common nowadays. > > I read online that Spark should integrate nicely with Hive tables, > i.e. instead of querying HDFS files, querying a hive table has the > same performance [5]. We also all know that Impala integrates nicely > with Hive so that overall, I can expect writing to Hive internal > tables instead of HDFS parquet doesn't have any disadvantages for me. > > My questions: > 1. Can I use Flink to "streaming write" to Hive? > 2. For compaction, I need "transactional tables" and according to the > hive docs, transactional tables must be fully managed by hive (i.e., > they are not external). Does Flink support writing to those out of the > box? (I only have Hive 2 available) > 3. Does Flink use the "Hive Streaming Data Ingest" APIs? > 4. Do you see any downsides in writing to hive compared to writing to > parquet directly? (Especially in my usecase only having impala and > spark consumers) > 5. Not Flink related: Have you ever experienced performance issues > when using hive transactional tables over writing parquet directly? I > guess there must be a reason why "transactional" is off by default in > Hive? I won't use any features except for compaction, i.e. there are > only streaming inserts, no updates, no deletes. (Delete only after > given retention and always delete entire partitions) > > > Best regards > Theo > > [1] > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Streaming-data-to-parquet-td38029.html > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Streaming-data-to-parquet-td38029.html> > [2] https://www.youtube.com/watch?v=eOQ2073iWt4 > <https://www.youtube.com/watch?v=eOQ2073iWt4> > [3] > https://blog.cloudera.com/how-to-ingest-and-query-fast-data-with-impala-without-kudu/ > <https://blog.cloudera.com/how-to-ingest-and-query-fast-data-with-impala-without-kudu/> > > [4] > https://blog.cloudera.com/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/ > <https://blog.cloudera.com/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/> > > [5] > https://stackoverflow.com/questions/51190646/spark-dataset-on-hive-vs-parquet-file > <https://stackoverflow.com/questions/51190646/spark-dataset-on-hive-vs-parquet-file>
OpenPGP_signature
Description: OpenPGP digital signature