Hi Bryant,
the below docs are a good start on performance tuning
https://spark.apache.org/docs/latest/sql-performance-tuning.html
Hope it helps!
On Wed, Nov 29, 2023 at 9:32 AM Bryant Wright
wrote:
> Hi, I'm looking for a comprehensive list of Tuning Best Practices for
> spark.
>
> I did a se
Hi Haseeb,
I think the user mailing list is what you're looking for, people are
usually pretty active on here if you present a direct question about apache
spark. I've linked below the community guidelines which says which mailing
lists are for what etc
https://spark.apache.org/community.html
Th
operty which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 7 Sept 2023 at 01:42, Jack Goodson wrote:
>
din.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on t
Hi,
I've got a number of tables that I'm loading in from a SQL server. The
timestamp in SQL server is stored like 2003-11-24T09:02:32 I get these as
parquet files in our raw storage location and pick them up in Databricks.
When I load the data in databricks, the dataframe/spark assumes UTC or
+000
Hi,
There is some good documentation under here
https://docs.databricks.com/structured-streaming/query-recovery.html
Under the “recovery after change in structured streaming query” heading
that gives good general guidelines on what can be changed in a “pause” of a
stream
On Thu, 16 Feb 2023 at
Hi,
I'm wanting to start contributing to the Spark project, do I need a Jira
account at https://issues.apache.org/jira/projects/SPARK/summary before I'm
able to do this? If so can one please be created with this email address?
Thank you
As far as I understand you will need a GPU for each worker node or you will
need to partition the GPU processing somehow to each node which I think
would defeat the purpose. In Databricks for example when you select GPU
workers there is a GPU allocated to each worker. I assume this is the
“correct”
When reading in Gzip files, I’ve always read them into a data frame and then
written out to parquet/delta more or less in their raw form and then used these
files for my transformations as the workloads are now parallelisable from these
split files, when reading in Gzips these will be read by th