Hi, We are trying to implement Bigquery source on Flink. I see that there is an existing JIRA <https://issues.apache.org/jira/browse/FLINK-22665> but there is no progress on it. I see there is a PoC by Mat. We are also thinking of using the DynamicTable interface to implement. We can use this mailing thread to discuss ideas.
I had a few questions: 1. For ScanRuntimeProvider, should I implement Inputformat like in the PoC or the current recommendation would be implement the Source interface itself? For example in kafka source, or file source. 2. Bigquery has similar functionalities like partitioning based on timestamp as Hive, but file system level information is not available. We only have Bigquery clients to read data. HiveSource implements AbsractFileSource, Would it make sense to implement FileSource or create a new source and write it from scratch, because there is no filesystem information available on the BQ side? 3. How to create splits on the source? I already asked about splits on the source in another email. Any other suggestions are welcome. I guess it might be a little different for bounded and unbounded. For a bounded source: We can take the time partition granularity of minute/hour/day as config and make buckets. For example: Hour granularity and 7 days of data, it will make 7*24 splits. In the CustomSplit class we can save the start and end of timestamps for the reader to execute. 4. What properties should the source take? Currently I am thinking, columns and a time range. Maybe we can implement predicate pushdown too, atleast for filter and projection. 4. Should we work towards merging the bigquery connector in the main repo? What challenges do we see? We will try to add more questions on this thread. Feel free to reply to any of these. We can shift the conversation to jira too, if that will help :) Thanks ~Lav