RE: Spark on Java 17

2023-12-09 Thread Luca Canali
/LucaCanali/sparkMeasure A few tests of microbenchmarking Spark reading Parquet with a few different JDKs at: https://db-blog.web.cern.ch/node/192 Best, Luca From: Faiz Halde Sent: Thursday, December 7, 2023 23:25 To: user@spark.apache.org Subject: Spark on Java 17 Hello, We are planning to switch

RE: Profiling PySpark Pandas UDF

2022-08-29 Thread Luca Canali
ack to testing that at a later stage. It definitely would be good to know if people using PySpark and Python UDFs find this proposed improvement useful. I see the proposed additional instrumentation as complementary to the Python/Pandas UDF Profiler introduced in Spark 3.3. Best, Luca

RE: Profiling PySpark Pandas UDF

2022-08-26 Thread Luca Canali
@Abdeali as for “lightweight profiling”, there is some work in progress on instrumenting Python UDFs with Spark metrics, see https://issues.apache.org/jira/browse/SPARK-34265 However it is a bit stuck at the moment, and needs to be revived I believe. Best, Luca From: Abdeali

[no subject]

2022-02-24 Thread Luca Borin
Unsubscribe

RE: measure running time

2021-12-23 Thread Luca Canali
Hi Mich, With Spark 3.1.1 you need to use spark-measure built with Scala 2.12: bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 Best, Luca From: Mich Talebzadeh Sent: Thursday, December 23, 2021 19:59 To: Luca Canali Cc: user Subject: Re: measure running

RE: measure running time

2021-12-23 Thread Luca Canali
://spark.apache.org/docs/latest/monitoring.html You can also have a look at this tool that takes care of automating collecting and aggregating some executor task metrics: https://github.com/LucaCanali/sparkMeasure Best, Luca From: Gourav Sengupta Sent: Thursday, December 23, 2021 14:23

RE: How to estimate the executor memory size according by the data

2021-12-23 Thread Luca Canali
API and the Spark metrics system, see https://spark.apache.org/docs/latest/monitoring.html Further information on the topic also at https://db-blog.web.cern.ch/blog/luca-canali/2020-08-spark3-memory-monitoring Best, Luca -Original Message- From: Arthur Li Sent: Thursday, December

RE: Spark 3.0 plugins

2021-12-20 Thread Luca Canali
using Metrics and Plugins - Databricks <https://databricks.com/session_na21/monitor-apache-spark-3-on-kubernetes-using-metrics-and-plugins> Best, Luca From: Anil Dasari Sent: Monday, December 20, 2021 07:02 To: user@spark.apache.org Subject: Spark 3.0 plugins Hello everyone,

RE: Spark Prometheus Metrics for Executors Not Working

2021-05-24 Thread Luca Canali
. Best, Luca -Original Message- From: paulp Sent: Monday, May 24, 2021 17:09 To: user@spark.apache.org Subject: Spark Prometheus Metrics for Executors Not Working Hi, recently our team has evaluated the prometheusServlet configuration in order to have Spark master, worker, driver and

RE: Understanding Executors UI

2021-01-08 Thread Luca Canali
proved memory instrumentation and improved instrumentation for streaming, so you can you profit from testing there too. From: Eric Beabes Sent: Friday, January 8, 2021 04:23 To: Luca Canali Cc: spark-user Subject: Re: Understanding Executors UI So when I see this for 'Storage Memory': 3.3TB/

RE: Understanding Executors UI

2021-01-06 Thread Luca Canali
://spark.apache.org/docs/latest/tuning.html#memory-management-overview Additional resource: see also this diagram https://canali.web.cern.ch/docs/SparkExecutorMemory.png and https://db-blog.web.cern.ch/blog/luca-canali/2020-08-spark3-memory-monitoring Best, Luca From: Eric Beabes Sent: Wednesday, January

RE: Adding isolation level when reading from DB2 with spark.read

2020-09-02 Thread Luca Canali
ISOLATION statement, although I am not familiar with the details of DB2. Would that be useful for your use case? Best, Luca -Original Message- From: Filipa Sousa Sent: Wednesday, September 2, 2020 16:34 To: user@spark.apache.org Cc: Ana Sofia Martins Subject: Adding isolation level when r

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-08-24 Thread Luca Canali
thing like this (for Spark 3.0): val df=spark.read.parquet("/TPCDS/tpcds_1500/store_sales") df.write.format("noop").mode("overwrite").save Best, Luca From: Rao, Abhishek (Nokia - IN/Bangalore) Sent: Monday, August 24, 2020 13:50 To: user@spark.apache.org Subject:

Spark 2.4.4, RPC encryption and Python

2020-01-16 Thread Luca Toscano
seems a Python only problem that doesn't affect Scala. I didn't find any outstanding bugs, so given the fact that 2.4.4 is very recent I thought to report it in here to ask for an advice :) Thanks in advance! Luca - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Apache Spark Log4j logging applicationId

2019-07-23 Thread Luca Borin
Hi, I would like to add the applicationId to all logs produced by Spark through Log4j. Consider that I have a cluster with several jobs running in it, so the presence of the applicationId would be useful to logically divide them. I have found a partial solution. If I change the layout of the Patt

RE: tcps oracle connection from spark

2019-06-19 Thread Luca Canali
Connecting to Oracle from Spark using the TPCS protocol works OK for me. Maybe try to turn debug on with -Djavax.net.debug=all? See also: https://blogs.oracle.com/dev2dev/ssl-connection-to-oracle-db-using-jdbc%2c-tlsv12%2c-jks-or-oracle-wallets Regards, L. From: Richard Xin Sent: Wednesday, June

RE: Spark Profiler

2019-03-27 Thread Luca Canali
I find that the Spark metrics system is quite useful to gather resource utilization metrics of Spark applications, including CPU, memory and I/O. If you are interested an example how this works for us at: https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark If

RE: kerberos auth for MS SQL server jdbc driver

2018-10-15 Thread Luca Canali
://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_Executors_Kerberos_HowTo.md Regards, Luca From: Marcelo Vanzin Sent: Monday, October 15, 2018 18:32 To: foster.langb...@riskfrontiers.com Cc: user Subject: Re: kerberos auth for MS SQL server jdbc driver Spark only does Kerberos

Spark and Kafka direct approach problem

2016-05-04 Thread Luca Ferrari
a:484) at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:607) at org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala) at it.unimi.di.luca.SimpleApp.main(SimpleApp.java:53) Any suggestions? Cheers Luca

R: How many disks for spark_local_dirs?

2016-04-18 Thread Luca Guerra
Hi Jan, It's a physical server, I have launched the application with: - "spark.cores.max": "12", - "spark.executor.cores": "3" - 2 GB RAM per worker Spark version is 1.6.0, I don't use Hadoop. Thanks, Luca -Mes

R: How many disks for spark_local_dirs?

2016-04-18 Thread Luca Guerra
Hi Mich, I have only 32 cores, I have tested with 2 GB of memory per worker to force spills to disk. My application had 12 cores and 3 cores per executor. Thank you very much. Luca Da: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Inviato: venerdì 15 aprile 2016 18:56 A: Luca Guerra Cc

Spark 1.6.0 - token renew failure

2016-04-13 Thread Luca Rea
Hi, I'm testing Livy server with Hue 3.9 and Spark 1.6.0 inside a kerberized cluster (HDP 2.4), when I run the command /usr/java/jdk1.7.0_71//bin/java -Dhdp.version=2.4.0.0-169 -cp /usr/hdp/2.4.0.0-169/spark/conf/:/usr/hdp/2.4.0.0-169/spark/lib/spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0

Re: Help with collect() in Spark Streaming

2015-09-12 Thread Luca
> it up, you still need to copy all of the data to a single node. Is there > something which forces you to only write from a single node? > > > On Friday, September 11, 2015, Luca wrote: > >> Hi, >> thanks for answering. >> >> With the *coalesce() *trans

Re: Help with collect() in Spark Streaming

2015-09-11 Thread Luca
Hi, thanks for answering. With the *coalesce() *transformation a single worker is in charge of writing to HDFS, but I noticed that the single write operation usually takes too much time, slowing down the whole computation (this is particularly true when 'unified' is made of several partitions). Be

Re: Kafka direct approach: blockInterval and topic partitions

2015-08-10 Thread Luca
Thank you! :) 2015-08-10 19:58 GMT+02:00 Cody Koeninger : > There's no long-running receiver pushing blocks of messages, so > blockInterval isn't relevant. > > Batch interval is what matters. > > On Mon, Aug 10, 2015 at 12:52 PM, allonsy wrote: > >> Hi everyone, >> >> I recently started using th

[no subject]

2015-02-18 Thread Luca Puggini

Re: generate a random matrix with uniform distribution

2015-02-09 Thread Luca Puggini
Thanks a lot! Can I ask why this code generates a uniform distribution? If dist is N(0,1) data should be N(-1, 2). Let me know. Thanks, Luca 2015-02-07 3:00 GMT+00:00 Burak Yavuz : > Hi, > > You can do the following: > ``` > import org.apache.spark.mllib.linalg.distributed.Row

matrix of random variables with spark.

2015-02-06 Thread Luca Puggini
Hi all, this is my first email with this mailing list and I hope that I am not doing anything wrong. I am currently trying to define a distributed matrix with n rows and k columns where each element is randomly sampled by a uniform distribution. How can I do that? It would be also nice if you can