Hi, We have a use case to stream files from GCS time-partitioned folders and perform structured streaming queries on top of them. I have detailed the use cases and requirements in this Stackoverflow question <https://stackoverflow.com/questions/66590057/spark-structured-streaming-source-watching-time-partitioned-gcs-partitions> but at a high level, the problems I am facing are listed below and would like guidance on the best approach to use
- Custom source APIs for Structured Streaming are undergoing major changes (including the new Table API support) and the documentation does not capture much details when it comes to building custom sources. I was wondering if the current APIs are expected to remain stable through the targeted 3.2 release and if there are examples on how to use them for my use case. - The default FileStream <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#creating-streaming-dataframes-and-streaming-datasets> source looks up a static glob path which might not scale when the job runs for days with multiple time partitions. But it has some really useful features handling files - supports all major source formats (AVRO, Parquet, JSON etc...), takes care of compression and partitioning large files into sub-tasks - all of which I need to implement again for the current custom source APIs as they stand. I was wondering if I can still somehow make use of them while solving the scaling time partitioning file globbing issue. Thanks