Hi Lavkesh, I'm not familiar with Big Query but when looking through the BQ API, I noticed that the `Table` resource provides both a timePartioning and a rangePartioning. [1] Couldn't you use that?
Best regards, Martijn https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#Table On Tue, Oct 18, 2022 at 3:44 AM yuxia <luoyu...@alumni.sjtu.edu.cn> wrote: > I'm familiar with Hive source but have no much knowledge about Bigquery. > But from my side, the apprach number three sounds more reasonable. > > option1 sounds a llitte of complex and may time-counsuming during > generateing splits . > option2 seems isnot flexible and is too coarse-grained. > option4 need extrac efforts to wrting the data again. > > Best regards, > Yuxia > > ----- 原始邮件 ----- > 发件人: "Lavkesh Lahngir" <lavk...@linux.com> > 收件人: "dev" <dev@flink.apache.org> > 发送时间: 星期一, 2022年 10 月 17日 下午 10:42:29 > 主题: SplitEnumerator for Bigquery Source. > > Hii Everybody, > we are trying to implement a google bigquery source on flink. We were > thinking of taking time partition and column information as config. I was > thinking of how to parallelize the source and how to generate splits. I > read the code of Hive source, where we could generate hadoop file splits > based on partitions. There is no way to access file level information on > BQ. > What would be a solution to generate splits for BQ source? > > Currently, most of our tables are partitioned daily. Assuming the columns > and time range are taken as config. > Some ideas from me to generate splits: > 1. Calculate approximate number of rows and size and divide them equally. > This will require some way to add a marker for division. > 2. For each daily partition create one split. > 3. We can take the time partition granularity of minute/hour/day as config > and make buckets. For example: Hour granularity and 7 days of data, it will > make 7*24 splits. In the CustomSplit class we can save the start and end of > timestamps for the reader to execute. > 4. Scan all the data into a distributed file system like hadoop or gcs. > Then just use file splitter. > > I am thinking of going with approach number three. Because calculation of > splits is just config based, it doesn't require reading any data to > calculate, for example option four. > > Any suggestions are welcome. > > Thank you! > ~lav >