Spark Structured Streaming from GCS files

Gowrishankar Sunder Mon, 15 Mar 2021 11:59:57 -0700

Hi,
   We have a use case to stream files from GCS time-partitioned folders and
perform structured streaming queries on top of them. I have detailed the
use cases and requirements in this Stackoverflow question
<https://stackoverflow.com/questions/66590057/spark-structured-streaming-source-watching-time-partitioned-gcs-partitions>
but
at a high level, the problems I am facing are listed below and would like
guidance on the best approach to use


   - Custom source APIs for Structured Streaming are undergoing major
   changes (including the new Table API support) and the documentation does
   not capture much details when it comes to building custom sources. I was
   wondering if the current APIs are expected to remain stable through the
   targeted 3.2 release and if there are examples on how to use them for my
   use case.
   - The default FileStream
   
<https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#creating-streaming-dataframes-and-streaming-datasets>
   source looks up a static glob path which might not scale when the job runs
   for days with multiple time partitions. But it has some really useful
   features handling files - supports all major source formats (AVRO, Parquet,
   JSON etc...), takes care of compression and partitioning large files into
   sub-tasks - all of which I need to implement again for the current custom
   source APIs as they stand. I was wondering if I can still somehow make use
   of them while solving the scaling time partitioning file globbing issue.

Thanks

Spark Structured Streaming from GCS files

Reply via email to