Interesting.

If we go back to classic Lambda architecture on premise, you could Flume
API to Kafka to add files to HDFS in time series bases.

Most higher CDC vendors do exactly that. Oracle GoldenGate (OGG) classic
gets data from Oracle redo logs and sends them to subscribers. One can
deploy OGC for Big Data to enable these files to be read and processed for
Kafka, Hive, HDFS etc.

So let us assume that we create these files and stream them on object
storage in Cloud. Then we can use Spark Structure Streaming (SSS) to act as
ETL tool. Assuming that streaming interval to be 10 minutes, we can still
read them but ensure that we only trigger SSS reads every 4 hours.

                     writeStream. \
                     outputMode('append'). \
                     option("truncate", "false"). \
                     foreachBatch(sendToSink). \
                     trigger(processingTime='14400 seconds'). \
                     queryName('readFiles'). \
                     start()

This will ensure that spark only processes them every 4 hours.


HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 23 Apr 2021 at 15:40, ayan guha <guha.a...@gmail.com> wrote:

> Hi
>
> In one of the spark summit demo, it is been alluded that we should think
> batch jobs in streaming pattern, using "run once" in a schedule.
> I find this idea very interesting and I understand how this can be
> achieved for sources like kafka, kinesis or similar. in fact we have
> implemented this model for cosmos changefeed.
>
> My question is: can this model extend to file based sources? I understand
> it can be for append only file  streams. The use case I have is: A CDC tool
> like aws dms or shareplex or similar writing changes to a stream of files,
> in date based folders. So it just goes on like T1, T2 etc folders. Also,
> lets assume files are written every 10 mins, but I want to process them
> every 4 hours.
> Can I use streaming method so that it can manage checkpoints on its own?
>
> Best - Ayan
> --
> Best Regards,
> Ayan Guha
>

Reply via email to