Bigquery source/connector on Flink

Lavkesh Lahngir Mon, 17 Oct 2022 23:35:51 -0700

Hi,
We are trying to implement Bigquery source on Flink. I see that there is an
existing JIRA <https://issues.apache.org/jira/browse/FLINK-22665> but there
is no progress on it. I see there is a PoC by Mat. We are also thinking of
using the DynamicTable interface to implement. We can use this mailing
thread to discuss ideas.


I had a few questions:
1. For ScanRuntimeProvider, should I implement Inputformat like in the PoC
or the current recommendation would be implement the Source interface
itself? For example in kafka source, or file source.
2. Bigquery has similar functionalities like partitioning based on
timestamp as Hive, but file system level information is not available. We
only have Bigquery clients to read data. HiveSource implements
AbsractFileSource, Would it make sense to implement FileSource or create a
new source and write it from scratch, because there is no filesystem
information available on the BQ side?
3. How to create splits on the source? I already asked about splits on the
source in another email. Any other suggestions are welcome. I guess it
might be a little different for bounded and unbounded.
For a bounded source:
We can take the time partition granularity of minute/hour/day as config and
make buckets. For example: Hour granularity and 7 days of data, it will
make 7*24 splits. In the CustomSplit class we can save the start and end of
timestamps for the reader to execute.
4. What properties should the source take? Currently I am thinking, columns
and a time range. Maybe we can implement predicate pushdown too, atleast
for filter and projection.
4. Should we work towards merging the bigquery connector in the main repo?
What challenges do we see?

We will try to add more questions on this thread. Feel free to reply to any
of these. We can shift the conversation to jira too, if that will help :)

Thanks
~Lav

Bigquery source/connector on Flink

Reply via email to