Our online services running in GCP collect data from our clients and write
it to GCS under time-partitioned folders like /mm/dd/hh/mm
(current-time) or similar ones. We need these files to be processed in
real-time from Spark. As for the runtime, we plan to run it either on
Dataproc or K8s.
-
Hi,
I looked at the stackoverflow reference.
The first question that comes to my mind is how you are populating these
gcs buckets? Are you shifting data from on-prem and landing them in the
buckets and creating a new folder at the given interval?
Where will you be running your Spark Structured