I'm familiar with Hive source but have no much knowledge about Bigquery. 
But from my side, the apprach number three sounds more reasonable.

option1 sounds a llitte of complex and may time-counsuming during generateing 
splits .
option2 seems isnot flexible and is too coarse-grained.
option4 need extrac efforts to wrting the data again.

Best regards,
Yuxia

----- 原始邮件 -----
发件人: "Lavkesh Lahngir" <lavk...@linux.com>
收件人: "dev" <dev@flink.apache.org>
发送时间: 星期一, 2022年 10 月 17日 下午 10:42:29
主题: SplitEnumerator for Bigquery Source.

Hii Everybody,
we are trying to implement a google bigquery source on flink. We were
thinking of taking time partition and column information as config. I was
thinking of how to parallelize the source and how to generate splits. I
read the code of Hive source, where we could generate hadoop file splits
based on partitions. There is no way to access file level information on BQ.
What would be a solution to generate splits for BQ source?

Currently, most of our tables are partitioned daily. Assuming the columns
and time range are taken as config.
Some ideas from me to generate splits:
1. Calculate approximate number of rows and size and divide them equally.
This will require some way to add a marker for division.
2. For each daily partition create one split.
3. We can take the time partition granularity of minute/hour/day as config
and make buckets. For example: Hour granularity and 7 days of data, it will
make 7*24 splits. In the CustomSplit class we can save the start and end of
timestamps for the reader to execute.
4. Scan all the data into a distributed file system like hadoop or gcs.
Then just use file splitter.

I am thinking of going with approach number three. Because calculation of
splits is just config based, it doesn't require reading any data to
calculate, for example option four.

Any suggestions are welcome.

Thank you!
~lav

Reply via email to