Re: Decompress Gzip files from EventHub with Structured Streaming

ayan guha Tue, 08 Mar 2022 11:35:53 -0800

Hi

IMHO this is not the best use of spark. I would suggest to use simple azure
function to unzip.


Is there any specific reason to use gzip over event hub?

If you can wait 10-20 sec to process, you can use eventhub capture to write
data to storage and  then process it.

It all depends on compute you are willing to pay, every 3 sec of scheduled
job should not give you any benefit over streaming.

Best
Ayan

On Wed, 9 Mar 2022 at 5:42 am, Data Guy <dataengineer4...@gmail.com> wrote:

> Hi everyone,
>
> *<first time writing to this mailing list>*
>
> Context: I have events coming into Databricks from an Azure Event Hub in a
> Gzip compressed format. Currently, I extract the files with a UDF and send
> the unzipped data into the silver layer in my Delta Lake with .write. Note
> that even though data comes in continuously I do not use .writeStream as of
> now.
>
> I have a few design-related questions that I hope someone with experience
> could help me with!
>
>    1. Is there a better way to extract Gzip files than a UDF?
>    2. Is Spark Structured Streaming or Batch with Databricks Jobs better?
>    (Pipeline runs every 3 hours once, but the data is continuously coming from
>    Event Hub)
>    3. Should I use Autoloader or just simply stream data into Databricks
>    using Event Hubs?
>
> I am especially curious about the trade-offs and the best way forward. I
> don't have massive amounts of data.
>
> Thank you very much in advance!
>
> Best wishes,
> Maurizio Vancho Argall
>
> --
Best Regards,
Ayan Guha

Re: Decompress Gzip files from EventHub with Structured Streaming

Reply via email to