Hello Rajat,
Look up the spark *Pipelining* concept; any sequence of operations that
feed data directly into each other without need for shuffling will packed
into a single stage, ie select -> filter -> select (SparkSQL) ; map ->
filter -> map (RDD), for any operation that requires shuffling (sort
o/documentation/reference/stable/connectors/mysql.html>
>>to read Write Ahead logs(WAL) and send to Kafka
>>- Kafka connect to write to cloud storage -> Hive
>> - OR
>>
>>
>>- Spark streaming to parse WAL -> Storage -> Hive
>>
&g
If you have space for a message log like, then you should try:
MySQL -> Kafka (via CDC) -> Spark (Structured Streaming) -> HDFS/S3/ADLS ->
Hive
On Wed, Aug 17, 2022 at 5:40 PM Akash Vellukai
wrote:
> Dear sir
>
> I have tried a lot on this could you help me with this?
>
> Data ingestion from My
Hi Folks,
Have created a UDF that queries a confluent schema registry for a schema,
which is then used within a Dataset Select with the from_avro function to
decode an avro encoded value (reading from a bunch of kafka topics)
Dataset recordDF = df.select(
callUDF("getjsonSchemaUDF",col(