RE: Spark performance over S3

2021-04-07 Thread Boris Litvak
Oh, Tzahi, I misread the metrics in the first reply. It’s about reads indeed, not writes. From: Tzahi File Sent: Wednesday, 7 April 2021 16:02 To: Hariharan Cc: user Subject: Re: Spark performance over S3 Hi Hariharan, Thanks for your reply. In both cases we are writing the data to S3. The

Re: Spark performance over S3

2021-04-07 Thread Tzahi File
Hi Hariharan, Thanks for your reply. In both cases we are writing the data to S3. The difference is that in the first case we read the data from S3 and in the second we read from HDFS. We are using ListObjectsV2 API in S3A . The S3 bucket and t

Re: Spark performance over S3

2021-04-07 Thread Vladimir Prus
VPC endpoint can also make a major difference in costs. Without it, access to S3 incurs data transfer costs and NAT costs, and these can be large. On Wed, 7 Apr 2021 at 14:13, Hariharan wrote: > Hi Tzahi, > > Comparing the first two cases: > >- > reads the parquet files from S3 and also writ

Re: Spark performance over S3

2021-04-07 Thread Hariharan
Hi Tzahi, Comparing the first two cases: - > reads the parquet files from S3 and also writes to S3, it takes 22 min - > reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min) It looks like most of the time is being spent in reading, and the time s

RE: Spark performance over S3

2021-04-06 Thread Boris Litvak
n to compare this with EMRFS performance … I know it requires you to put in some work. Boris From: Gourav Sengupta Sent: Tuesday, 6 April 2021 22:24 To: Tzahi File Cc: user Subject: Re: Spark performance over S3 Hi Tzahi, that is a huge cost. So that I can understand the question before answe

Re: Spark performance over S3

2021-04-06 Thread Gourav Sengupta
Hi Tzahi, that is a huge cost. So that I can understand the question before answering it: 1. what is the SPARK version that you are using? 2. what is the SQL code that you are using to read and write? There are several other questions that are pertinent, but the above will be a great starting poi

Re: SPARK PERFORMANCE TUNING

2016-09-21 Thread Mich Talebzadeh
LOL I think we should try the Chrystal ball to answer this question. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wor

Re: SPARK PERFORMANCE TUNING

2016-09-21 Thread Jörn Franke
Do you mind sharing what your software does? What is the input data size? What is the spark version and apis used? How many nodes? What is the input data format? Is compression used? > On 21 Sep 2016, at 13:37, Trinadh Kaja wrote: > > Hi all, > > how to increase spark performance ,i am using

Re: Spark performance testing

2016-07-08 Thread Mich Talebzadeh
Hi Andrew, I suggest that you narrow down your scope for performance testing using the same setup and doing incremental changes keeping other systematics the same. Spark itself can run on local, standalone, yarn client and yarn cluster modes So really you need to target a particular setup of run

Re: Spark performance testing

2016-07-08 Thread Andrew Ehrlich
Yea, I'm looking for any personal experiences people have had with tools like these. > On Jul 8, 2016, at 8:57 PM, charles li wrote: > > Hi, Andrew, I've got lots of materials when asking google for "spark > performance test" > > https://github.com/databricks/spark-perf > https://spark-summi

Re: Spark performance testing

2016-07-08 Thread charles li
Hi, Andrew, I've got lots of materials when asking google for "*spark performance test*" - https://github.com/databricks/spark-perf - https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf - http://people.cs.vt.edu/~butt

Re: spark performance non-linear response

2015-10-07 Thread Jonathan Coveney
I've noticed this as well and am curious if there is anything more people can say. My theory is that it is just communication overhead. If you only have a couple of gigabytes (a tiny dataset), then spotting that into 50 nodes means you'll have a ton of tiny partitions all finishing very quickly, a

Re: spark performance non-linear response

2015-10-07 Thread Sean Owen
OK, next question then is: if this is wall-clock time for the whole process, then, I wonder if you are just measuring the time taken by the longest single task. I'd expect the time taken by the longest straggler task to follow a distribution like this. That is, how balanced are the partitions? Are

Re: spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg
Additional missing relevant information: Im running a transformation, there are no Shuffles occurring and at the end im performing a lookup of 4 partitions on the driver. On 10/7/15 11:26 AM, Yadid Ayzenberg wrote: Hi All, Im using spark 1.4.1 to to analyze a largish data set (several Gig

Re: spark performance - executor computing time

2015-09-17 Thread Adrian Tanase
lto:user@spark.apache.org>" Subject: Re: spark performance - executor computing time Is this repeatable? Do you always get one or two executors that are 6 times as slow? It could be that some of your tasks have more work to do (maybe you are filtering some records out? If it’s always one p

Re: spark performance - executor computing time

2015-09-16 Thread Robin East
Is this repeatable? Do you always get one or two executors that are 6 times as slow? It could be that some of your tasks have more work to do (maybe you are filtering some records out? If it’s always one particular worker node is there something about the machine configuration (e.g. CPU speed) t

RE: Spark performance

2015-07-13 Thread Mohammed Guller
. Mohammed From: Michael Segel [mailto:msegel_had...@hotmail.com] Sent: Sunday, July 12, 2015 6:59 AM To: Mohammed Guller Cc: David Mitchell; Roman Sokolov; user; Ravisankar Mani Subject: Re: Spark performance Not necessarily. It depends on the use case and what you intend to do with the data. 4-6

Re: Spark performance

2015-07-12 Thread Michael Segel
ailto:jdavidmitch...@gmail.com] > Sent: Saturday, July 11, 2015 7:10 AM > To: Roman Sokolov > Cc: Mohammed Guller; user; Ravisankar Mani > Subject: Re: Spark performance > > You can certainly query over 4 TB of data with Spark. However, you will get > an answer in minu

Re: Spark performance

2015-07-12 Thread santoshv98
Ravi Spark (or in that case Big Data solutions like Hive) is suited for large analytical loads, where the “scaling up” starts to pale in comparison to “Scaling out” with regards to performance, versatility(types of data) and cost. Without going into the details of MsSQL architecture, there is

Re: Spark performance

2015-07-11 Thread Jörn Franke
Honestly you are addressing this wrongly - you do not seem.to have a business case for changing - so why do you want to switch Le sam. 11 juil. 2015 à 3:28, Mohammed Guller a écrit : > Hi Ravi, > > First, Neither Spark nor Spark SQL is a database. Both are compute > engines, which need to be pa

Re: Spark performance

2015-07-11 Thread Jörn Franke
Le sam. 11 juil. 2015 à 14:53, Roman Sokolov a écrit : > Hello. Had the same question. What if I need to store 4-6 Tb and do > queries? Can't find any clue in documentation. > Am 11.07.2015 03:28 schrieb "Mohammed Guller" : > >> Hi Ravi, >> >> First, Neither Spark nor Spark SQL is a database. Bo

RE: Spark performance

2015-07-11 Thread Mohammed Guller
: Roman Sokolov Cc: Mohammed Guller; user; Ravisankar Mani Subject: Re: Spark performance You can certainly query over 4 TB of data with Spark. However, you will get an answer in minutes or hours, not in milliseconds or seconds. OLTP databases are used for web applications, and typically return

Re: Spark performance

2015-07-11 Thread David Mitchell
You can certainly query over 4 TB of data with Spark. However, you will get an answer in minutes or hours, not in milliseconds or seconds. OLTP databases are used for web applications, and typically return responses in milliseconds. Analytic databases tend to operate on large data sets, and retu

RE: Spark performance

2015-07-11 Thread Roman Sokolov
Hello. Had the same question. What if I need to store 4-6 Tb and do queries? Can't find any clue in documentation. Am 11.07.2015 03:28 schrieb "Mohammed Guller" : > Hi Ravi, > > First, Neither Spark nor Spark SQL is a database. Both are compute > engines, which need to be paired with a storage sy

Re: Spark performance

2015-07-11 Thread Jörn Franke
What is your business case for the move? Le ven. 10 juil. 2015 à 12:49, Ravisankar Mani a écrit : > Hi everyone, > > I have planned to move mssql server to spark?. I have using around 50,000 > to 1l records. > The spark performance is slow when compared to mssql server. > > What is the best da

RE: Spark performance

2015-07-10 Thread Mohammed Guller
Hi Ravi, First, Neither Spark nor Spark SQL is a database. Both are compute engines, which need to be paired with a storage system. Seconds, they are designed for processing large distributed datasets. If you have only 100,000 records or even a million records, you don’t need Spark. A RDBMS will

Re: Spark performance issue

2015-07-03 Thread Silvio Fiorito
It’ll help to see the code or at least understand what transformations you’re using. Also, you have 15 nodes but not using all of them, so that means you may be losing data locality. You can see this in the job UI for Spark if any jobs do not have node or process local. From: diplomatic Guru D

Re: Spark performance in cluster mode using yarn

2015-05-14 Thread Sachin Singh
Hi Ayan, I am asking general scenarios as per given info/configuration, from experts, not specific, java code is nothing get hive context and select query, there is no serialization or any other complex things I kept,straight forward, 10 lines of code, Group Please suggest if any Idea, Regards Sac

Re: Spark performance in cluster mode using yarn

2015-05-14 Thread ayan guha
With this information it is hard to predict. What's the performance you are getting? What's your desired performance? Maybe you can post your code and experts can suggests improvement? On 14 May 2015 15:02, "sachin Singh" wrote: > Hi Friends, > please someone can give the idea, Ideally what shoul

Re: Spark Performance on Yarn

2015-04-22 Thread Neelesh Salian
Does it still hit the memory limit for the container? An expensive transformation? On Wed, Apr 22, 2015 at 8:45 AM, Ted Yu wrote: > In master branch, overhead is now 10%. > That would be 500 MB > > FYI > > > > > On Apr 22, 2015, at 8:26 AM, nsalian wrote: > > > > +1 to executor-memory to 5g. >

Re: Spark Performance on Yarn

2015-04-22 Thread Ted Yu
In master branch, overhead is now 10%. That would be 500 MB FYI > On Apr 22, 2015, at 8:26 AM, nsalian wrote: > > +1 to executor-memory to 5g. > Do check the overhead space for both the driver and the executor as per > Wilfred's suggestion. > > Typically, 384 MB should suffice. > > > >

Re: Spark Performance on Yarn

2015-04-22 Thread nsalian
+1 to executor-memory to 5g. Do check the overhead space for both the driver and the executor as per Wilfred's suggestion. Typically, 384 MB should suffice. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html Sent fr

Re: Spark Performance on Yarn

2015-04-21 Thread hnahak
Try --executor-memory 5g , because you have 8 gb RAM in each machine -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22603.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Spark Performance on Yarn

2015-04-20 Thread Peng Cheng
I got exactly the same problem, except that I'm running on a standalone master. Can you tell me the counterpart parameter on standalone master for increasing the same memroy overhead? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-

Re: Spark Performance on Yarn

2015-02-23 Thread Lee Bierman
Thanks for the suggestions. I removed the "persist" call from program. Doing so I started it with: spark-submit --class com.xxx.analytics.spark.AnalyticsJob --master yarn /tmp/analytics.jar --input_directory hdfs://ip:8020/flume/events/2015/02/ This takes all the default and only runs 2 execut

Re: Spark performance tuning

2015-02-22 Thread Akhil Das
You can simply follow these http://spark.apache.org/docs/1.2.0/tuning.html Thanks Best Regards On Sun, Feb 22, 2015 at 1:14 AM, java8964 wrote: > Can someone share some ideas about how to tune the GC time? > > Thanks > > -- > From: java8...@hotmail.com > To: user@spa

RE: Spark performance tuning

2015-02-21 Thread java8964
Can someone share some ideas about how to tune the GC time? Thanks From: java8...@hotmail.com To: user@spark.apache.org Subject: Spark performance tuning Date: Fri, 20 Feb 2015 16:04:23 -0500 Hi, I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I setup a standalo

Re: Spark Performance on Yarn

2015-02-21 Thread Davies Liu
How many executors you have per machine? It will be helpful if you could list all the configs. Could you also try to run it without persist? Caching do hurt than help, if you don't have enough memory. On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman wrote: > Thanks for the suggestions. > I'm experim

Re: Spark Performance on Yarn

2015-02-20 Thread Lee Bierman
Thanks for the suggestions. I'm experimenting with different values for spark memoryOverhead and explictly giving the executors more memory, but still have not found the golden medium to get it to finish in a proper time frame. Is my cluster massively undersized at 5 boxes, 8gb 2cpu ? Trying to fi

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
That's all correct. -Sandy On Fri, Feb 20, 2015 at 1:23 PM, Kelvin Chu <2dot7kel...@gmail.com> wrote: > Hi Sandy, > > I appreciate your clear explanation. Let me try again. It's the best way > to confirm I understand. > > spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory >

Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
Hi Sandy, I appreciate your clear explanation. Let me try again. It's the best way to confirm I understand. spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory that YARN will create a JVM spark.executor.memory = the memory I can actually use in my jvm application = part of it

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
Hi Kelvin, spark.executor.memory controls the size of the executor heaps. spark.yarn.executor.memoryOverhead is the amount of memory to request from YARN beyond the heap size. This accounts for the fact that JVMs use some non-heap memory. The Spark heap is divided into spark.storage.memoryFract

Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
Hi Sandy, I am also doing memory tuning on YARN. Just want to confirm, is it correct to say: spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I can actually use in my jvm application If it is not, what is the correct relationship? Any other variables or config parameters i

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman wrote: > A bit more context on th

Re: Spark Performance on Yarn

2015-02-20 Thread lbierman
A bit more context on this issue. From the container logs on the executor Given my cluster specs above what would be appropriate parameters to pass into : --num-executors --num-cores --executor-memory I had tried it with --executor-memory 2500MB 015-02-20 06:50:09,056 WARN org.apache.hadoop.ya

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
Are you specifying the executor memory, cores, or number of executors anywhere? If not, you won't be taking advantage of the full resources on the cluster. -Sandy On Fri, Feb 20, 2015 at 2:41 AM, Sean Owen wrote: > None of this really points to the problem. These indicate that workers > died b

Re: Spark Performance on Yarn

2015-02-20 Thread Sean Owen
None of this really points to the problem. These indicate that workers died but not why. I'd first go locate executor logs that reveal more about what's happening. It sounds like a hard-er type of failure, like JVM crash or running out of file handles, or GC thrashing. On Fri, Feb 20, 2015 at 4:51

Re: Spark performance for small queries

2015-01-22 Thread Saumitra Shahapure (Vizury)
Hello, We were comparing performance of some of our production hive queries between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see that the performance gains have been good in Spark. We tried a very simple query, select count(*) from T where col

Re: Spark performance optimization examples

2014-11-24 Thread Akhil Das
Here's the tuning guidelines if you haven't seen it already. http://spark.apache.org/docs/latest/tuning.html You could try the following to get it loaded: - Use kryo Serialization - Enable RDD Compression - Set Storage level to

Re: Spark performance optimization

2014-02-24 Thread Roshan Nair
Hi, We use sequence files as input as well. Spark creates a task for each part* file by default. We use RDD.coalesce (set to number of cores or 2*number of cores). This helps when there are many more part* files than the number of cores and each part* file is relatively small. Coalesce doesn't act

Re: Spark performance optimization

2014-02-24 Thread Andrew Ash
Have you tried using a standalone spark cluster vs a YARN one? I get the impression that standalone responses are faster (the JVMs are already all running) but haven't done any rigorous testing (and have only used standalone so far). On Mon, Feb 24, 2014 at 10:43 PM, polkosity wrote: > As ment