Hi Faiz,
We find G1GC works well for some of our workloads that are Parquet-read
intensive and we have been using G1GC with Spark on Java 8 already
(spark.driver.extraJavaOptions and spark.executor.extraJavaOptions=
“-XX:+UseG1GC”), while currently we are mostly running Spark (3.3 and higher)
From: Abdeali Kothari
Sent: Friday, August 26, 2022 15:59
To: Luca Canali
Cc: Russell Jurney ; Gourav Sengupta
; Sean Owen ; Takuya UESHIN
; user ; Subash Prabanantham
Subject: Re: Profiling PySpark Pandas UDF
Hi Luca, I see you pushed some code to the PR 3 hrs ago.
That's awesome. If
@Abdeali as for “lightweight profiling”, there is some work in progress on
instrumenting Python UDFs with Spark metrics, see
https://issues.apache.org/jira/browse/SPARK-34265
However it is a bit stuck at the moment, and needs to be revived I believe.
Best,
Luca
From: Abdeali Kothari
Hi Mich,
With Spark 3.1.1 you need to use spark-measure built with Scala 2.12:
bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
Best,
Luca
From: Mich Talebzadeh
Sent: Thursday, December 23, 2021 19:59
To: Luca Canali
Cc: user
Subject: Re: measure running
Hi,
I agree with Gourav that just measuring execution time is a simplistic approach
that may lead you to miss important details, in particular when running
distributed computations.
WebUI, REST API, and metrics instrumentation in Spark can be quite useful for
further drill down. See https:/
API and the Spark
metrics system, see https://spark.apache.org/docs/latest/monitoring.html
Further information on the topic also at
https://db-blog.web.cern.ch/blog/luca-canali/2020-08-spark3-memory-monitoring
Best,
Luca
-Original Message-
From: Arthur Li
Sent: Thursday, December
Hi Anil,
To recap: Apache Spark plugins are an interface and configuration that allows
to inject code on executor start-up and, among others, provide a hook to the
Spark metrics system. This provides a way to extend metrics collection beyond
what is available in Apache Spark.
Instrumenti
The PrometheusServlet adds a servlet within the existing Spark UI to serve
metrics data in Prometheus format.
Similarly to what happens with the MetricsServlet, the Prometheus servlet does
not work on executors, as executors do not have a Spark UI end point to which
the servlet could attach to.
proved memory instrumentation and improved
instrumentation for streaming, so you can you profit from testing there too.
From: Eric Beabes
Sent: Friday, January 8, 2021 04:23
To: Luca Canali
Cc: spark-user
Subject: Re: Understanding Executors UI
So when I see this for 'Storage Memory': 3.3TB/
://spark.apache.org/docs/latest/tuning.html#memory-management-overview
Additional resource: see also this diagram
https://canali.web.cern.ch/docs/SparkExecutorMemory.png and
https://db-blog.web.cern.ch/blog/luca-canali/2020-08-spark3-memory-monitoring
Best,
Luca
From: Eric Beabes
Sent: Wednesday, January
Hi Filipa ,
Spark JDBC data source has the option to add a "sessionInitStatement".
Documented in https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
and https://issues.apache.org/jira/browse/SPARK-21519
I guess you could use that option to "inject " a SET ISOLATION statement,
altho
Hi Abhishek,
Just a few ideas/comments on the topic:
When benchmarking/testing I find it useful to collect a more complete view of
resources usage and Spark metrics, beyond just measuring query elapsed time.
Something like this:
https://github.com/cerndb/spark-dashboard
I'd rather not use dyn
Connecting to Oracle from Spark using the TPCS protocol works OK for me.
Maybe try to turn debug on with -Djavax.net.debug=all?
See also:
https://blogs.oracle.com/dev2dev/ssl-connection-to-oracle-db-using-jdbc%2c-tlsv12%2c-jks-or-oracle-wallets
Regards,
L.
From: Richard Xin
Sent: Wednesday, June
I find that the Spark metrics system is quite useful to gather resource
utilization metrics of Spark applications, including CPU, memory and I/O.
If you are interested an example how this works for us at:
https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark
If
We have a case where we interact with a Kerberized service and found a simple
workaround to distribute and make use of the driver’s Kerberos credential cache
file in the executors. Maybe some of the ideas there can be of help for this
case too? Our case in on Linux though. Details:
https://git
15 matches
Mail list logo