/LucaCanali/sparkMeasure
A few tests of microbenchmarking Spark reading Parquet with a few different
JDKs at: https://db-blog.web.cern.ch/node/192
Best,
Luca
From: Faiz Halde
Sent: Thursday, December 7, 2023 23:25
To: user@spark.apache.org
Subject: Spark on Java 17
Hello,
We are planning to switch
ack to testing that at a later stage.
It definitely would be good to know if people using PySpark and Python UDFs
find this proposed improvement useful.
I see the proposed additional instrumentation as complementary to the
Python/Pandas UDF Profiler introduced in Spark 3.3.
Best,
Luca
@Abdeali as for “lightweight profiling”, there is some work in progress on
instrumenting Python UDFs with Spark metrics, see
https://issues.apache.org/jira/browse/SPARK-34265
However it is a bit stuck at the moment, and needs to be revived I believe.
Best,
Luca
From: Abdeali
Unsubscribe
Hi Mich,
With Spark 3.1.1 you need to use spark-measure built with Scala 2.12:
bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
Best,
Luca
From: Mich Talebzadeh
Sent: Thursday, December 23, 2021 19:59
To: Luca Canali
Cc: user
Subject: Re: measure running
://spark.apache.org/docs/latest/monitoring.html
You can also have a look at this tool that takes care of automating collecting
and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure
Best,
Luca
From: Gourav Sengupta
Sent: Thursday, December 23, 2021 14:23
API and the Spark
metrics system, see https://spark.apache.org/docs/latest/monitoring.html
Further information on the topic also at
https://db-blog.web.cern.ch/blog/luca-canali/2020-08-spark3-memory-monitoring
Best,
Luca
-Original Message-
From: Arthur Li
Sent: Thursday, December
using Metrics and Plugins - Databricks
<https://databricks.com/session_na21/monitor-apache-spark-3-on-kubernetes-using-metrics-and-plugins>
Best,
Luca
From: Anil Dasari
Sent: Monday, December 20, 2021 07:02
To: user@spark.apache.org
Subject: Spark 3.0 plugins
Hello everyone,
.
Best,
Luca
-Original Message-
From: paulp
Sent: Monday, May 24, 2021 17:09
To: user@spark.apache.org
Subject: Spark Prometheus Metrics for Executors Not Working
Hi,
recently our team has evaluated the prometheusServlet configuration in order to
have Spark master, worker, driver and
proved memory instrumentation and improved
instrumentation for streaming, so you can you profit from testing there too.
From: Eric Beabes
Sent: Friday, January 8, 2021 04:23
To: Luca Canali
Cc: spark-user
Subject: Re: Understanding Executors UI
So when I see this for 'Storage Memory': 3.3TB/
://spark.apache.org/docs/latest/tuning.html#memory-management-overview
Additional resource: see also this diagram
https://canali.web.cern.ch/docs/SparkExecutorMemory.png and
https://db-blog.web.cern.ch/blog/luca-canali/2020-08-spark3-memory-monitoring
Best,
Luca
From: Eric Beabes
Sent: Wednesday, January
ISOLATION statement,
although I am not familiar with the details of DB2.
Would that be useful for your use case?
Best,
Luca
-Original Message-
From: Filipa Sousa
Sent: Wednesday, September 2, 2020 16:34
To: user@spark.apache.org
Cc: Ana Sofia Martins
Subject: Adding isolation level when r
thing like this (for Spark 3.0):
val df=spark.read.parquet("/TPCDS/tpcds_1500/store_sales")
df.write.format("noop").mode("overwrite").save
Best,
Luca
From: Rao, Abhishek (Nokia - IN/Bangalore)
Sent: Monday, August 24, 2020 13:50
To: user@spark.apache.org
Subject:
seems a Python only problem that doesn't affect Scala. I
didn't find any outstanding bugs, so given the fact that 2.4.4 is very
recent I thought to report it in here to ask for an advice :)
Thanks in advance!
Luca
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hi,
I would like to add the applicationId to all logs produced by Spark through
Log4j. Consider that I have a cluster with several jobs running in it, so
the presence of the applicationId would be useful to logically divide them.
I have found a partial solution. If I change the layout of the
Patt
Connecting to Oracle from Spark using the TPCS protocol works OK for me.
Maybe try to turn debug on with -Djavax.net.debug=all?
See also:
https://blogs.oracle.com/dev2dev/ssl-connection-to-oracle-db-using-jdbc%2c-tlsv12%2c-jks-or-oracle-wallets
Regards,
L.
From: Richard Xin
Sent: Wednesday, June
I find that the Spark metrics system is quite useful to gather resource
utilization metrics of Spark applications, including CPU, memory and I/O.
If you are interested an example how this works for us at:
https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark
If
://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_Executors_Kerberos_HowTo.md
Regards,
Luca
From: Marcelo Vanzin
Sent: Monday, October 15, 2018 18:32
To: foster.langb...@riskfrontiers.com
Cc: user
Subject: Re: kerberos auth for MS SQL server jdbc driver
Spark only does Kerberos
a:484)
at
org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:607)
at
org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)
at it.unimi.di.luca.SimpleApp.main(SimpleApp.java:53)
Any suggestions?
Cheers
Luca
Hi Jan,
It's a physical server, I have launched the application with:
- "spark.cores.max": "12",
- "spark.executor.cores": "3"
- 2 GB RAM per worker
Spark version is 1.6.0, I don't use Hadoop.
Thanks,
Luca
-Mes
Hi Mich,
I have only 32 cores, I have tested with 2 GB of memory per worker to force
spills to disk. My application had 12 cores and 3 cores per executor.
Thank you very much.
Luca
Da: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Inviato: venerdì 15 aprile 2016 18:56
A: Luca Guerra
Cc
Hi,
I'm testing Livy server with Hue 3.9 and Spark 1.6.0 inside a kerberized
cluster (HDP 2.4), when I run the command
/usr/java/jdk1.7.0_71//bin/java -Dhdp.version=2.4.0.0-169 -cp
/usr/hdp/2.4.0.0-169/spark/conf/:/usr/hdp/2.4.0.0-169/spark/lib/spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0
> it up, you still need to copy all of the data to a single node. Is there
> something which forces you to only write from a single node?
>
>
> On Friday, September 11, 2015, Luca wrote:
>
>> Hi,
>> thanks for answering.
>>
>> With the *coalesce() *trans
Hi,
thanks for answering.
With the *coalesce() *transformation a single worker is in charge of
writing to HDFS, but I noticed that the single write operation usually
takes too much time, slowing down the whole computation (this is
particularly true when 'unified' is made of several partitions). Be
Thank you! :)
2015-08-10 19:58 GMT+02:00 Cody Koeninger :
> There's no long-running receiver pushing blocks of messages, so
> blockInterval isn't relevant.
>
> Batch interval is what matters.
>
> On Mon, Aug 10, 2015 at 12:52 PM, allonsy wrote:
>
>> Hi everyone,
>>
>> I recently started using th
Thanks a lot!
Can I ask why this code generates a uniform distribution?
If dist is N(0,1) data should be N(-1, 2).
Let me know.
Thanks,
Luca
2015-02-07 3:00 GMT+00:00 Burak Yavuz :
> Hi,
>
> You can do the following:
> ```
> import org.apache.spark.mllib.linalg.distributed.Row
Hi all,
this is my first email with this mailing list and I hope that I am not
doing anything wrong.
I am currently trying to define a distributed matrix with n rows and k
columns where each element is randomly sampled by a uniform distribution.
How can I do that?
It would be also nice if you can
28 matches
Mail list logo