Hi there, 

Currently, I have a Flink 1.11 job which writes parquet files via the 
StreamingFileSink to HDFS (simply using DataStream API). I commit like every 3 
minutes and thus have many small files in HDFS. Downstream, the generated table 
is consumed from Spark Jobs and Impala queries. HDFS doesn't like to have too 
many small files and writing to parquet fast but also desiring large files is a 
rather common problem and solutions were suggested like recently in the mailing 
list [1] or in flink forward talks [2]. Cloudera also posted two possible 
scenarios in their blog posts [3], [4]. Mostly, it comes down to asynchronously 
compact the many small files into larger ones, at best non blocking and in an 
occasionally running batch job. 

I am now about to implement something like suggested in the cloudera blog [4] 
but from parquet to parquet. For me, it seems to be not straight forward but 
rather involved, especially as my data is partitioned in eventtime and I need 
the compaction to be non blocking (my users query impala and expect near real 
time performance in each query). When starting the work on that, I noticed that 
Hive already has a compaction mechanism included and the Flink community works 
a lot in terms of integrating with hive in the latest releases. Some of my 
questions are not directly related to Flink, but I guess many of you have also 
experience with hive and writing from Flink to Hive is rather common nowadays. 

I read online that Spark should integrate nicely with Hive tables, i.e. instead 
of querying HDFS files, querying a hive table has the same performance [5]. We 
also all know that Impala integrates nicely with Hive so that overall, I can 
expect writing to Hive internal tables instead of HDFS parquet doesn't have any 
disadvantages for me. 

My questions: 
1. Can I use Flink to "streaming write" to Hive? 
2. For compaction, I need "transactional tables" and according to the hive 
docs, transactional tables must be fully managed by hive (i.e., they are not 
external). Does Flink support writing to those out of the box? (I only have 
Hive 2 available) 
3. Does Flink use the "Hive Streaming Data Ingest" APIs? 
4. Do you see any downsides in writing to hive compared to writing to parquet 
directly? (Especially in my usecase only having impala and spark consumers) 
5. Not Flink related: Have you ever experienced performance issues when using 
hive transactional tables over writing parquet directly? I guess there must be 
a reason why "transactional" is off by default in Hive? I won't use any 
features except for compaction, i.e. there are only streaming inserts, no 
updates, no deletes. (Delete only after given retention and always delete 
entire partitions) 


Best regards 
Theo 

[1] [ 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Streaming-data-to-parquet-td38029.html
 | 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Streaming-data-to-parquet-td38029.html
 ] 
[2] [ https://www.youtube.com/watch?v=eOQ2073iWt4 | 
https://www.youtube.com/watch?v=eOQ2073iWt4 ] 
[3] [ 
https://blog.cloudera.com/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
 | 
https://blog.cloudera.com/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
 ] 
[4] [ 
https://blog.cloudera.com/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/
 | 
https://blog.cloudera.com/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/
 ] 
[5] [ 
https://stackoverflow.com/questions/51190646/spark-dataset-on-hive-vs-parquet-file
 | 
https://stackoverflow.com/questions/51190646/spark-dataset-on-hive-vs-parquet-file
 ] 

Reply via email to