Hi IMHO this is not the best use of spark. I would suggest to use simple azure function to unzip.
Is there any specific reason to use gzip over event hub? If you can wait 10-20 sec to process, you can use eventhub capture to write data to storage and then process it. It all depends on compute you are willing to pay, every 3 sec of scheduled job should not give you any benefit over streaming. Best Ayan On Wed, 9 Mar 2022 at 5:42 am, Data Guy <dataengineer4...@gmail.com> wrote: > Hi everyone, > > *<first time writing to this mailing list>* > > Context: I have events coming into Databricks from an Azure Event Hub in a > Gzip compressed format. Currently, I extract the files with a UDF and send > the unzipped data into the silver layer in my Delta Lake with .write. Note > that even though data comes in continuously I do not use .writeStream as of > now. > > I have a few design-related questions that I hope someone with experience > could help me with! > > 1. Is there a better way to extract Gzip files than a UDF? > 2. Is Spark Structured Streaming or Batch with Databricks Jobs better? > (Pipeline runs every 3 hours once, but the data is continuously coming from > Event Hub) > 3. Should I use Autoloader or just simply stream data into Databricks > using Event Hubs? > > I am especially curious about the trade-offs and the best way forward. I > don't have massive amounts of data. > > Thank you very much in advance! > > Best wishes, > Maurizio Vancho Argall > > -- Best Regards, Ayan Guha