Re: SplitEnumerator for Bigquery Source.

Martijn Visser Tue, 18 Oct 2022 00:28:45 -0700

Hi Lavkesh,

I'm not familiar with Big Query but when looking through the BQ API, I
noticed that the `Table` resource provides both a timePartioning and a
rangePartioning. [1] Couldn't you use that?


Best regards,

Martijn

https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#Table

On Tue, Oct 18, 2022 at 3:44 AM yuxia <luoyu...@alumni.sjtu.edu.cn> wrote:

> I'm familiar with Hive source but have no much knowledge about Bigquery.
> But from my side, the apprach number three sounds more reasonable.
>
> option1 sounds a llitte of complex and may time-counsuming during
> generateing splits .
> option2 seems isnot flexible and is too coarse-grained.
> option4 need extrac efforts to wrting the data again.
>
> Best regards,
> Yuxia
>
> ----- 原始邮件 -----
> 发件人: "Lavkesh Lahngir" <lavk...@linux.com>
> 收件人: "dev" <dev@flink.apache.org>
> 发送时间: 星期一, 2022年 10 月 17日 下午 10:42:29
> 主题: SplitEnumerator for Bigquery Source.
>
> Hii Everybody,
> we are trying to implement a google bigquery source on flink. We were
> thinking of taking time partition and column information as config. I was
> thinking of how to parallelize the source and how to generate splits. I
> read the code of Hive source, where we could generate hadoop file splits
> based on partitions. There is no way to access file level information on
> BQ.
> What would be a solution to generate splits for BQ source?
>
> Currently, most of our tables are partitioned daily. Assuming the columns
> and time range are taken as config.
> Some ideas from me to generate splits:
> 1. Calculate approximate number of rows and size and divide them equally.
> This will require some way to add a marker for division.
> 2. For each daily partition create one split.
> 3. We can take the time partition granularity of minute/hour/day as config
> and make buckets. For example: Hour granularity and 7 days of data, it will
> make 7*24 splits. In the CustomSplit class we can save the start and end of
> timestamps for the reader to execute.
> 4. Scan all the data into a distributed file system like hadoop or gcs.
> Then just use file splitter.
>
> I am thinking of going with approach number three. Because calculation of
> splits is just config based, it doesn't require reading any data to
> calculate, for example option four.
>
> Any suggestions are welcome.
>
> Thank you!
> ~lav
>

Re: SplitEnumerator for Bigquery Source.

Reply via email to