Re: Identify the performance bottleneck from hardware prospective

2015-03-05 Thread jalafate
ere blocking execution (e.g. Hazelcast). Could > that be your case as well? > > Regards > > -- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Identif

Re: Identify the performance bottleneck from hardware prospective

2015-03-05 Thread Julaiti Alafate
Hi Mitch, I think it is normal. The network utilization will be high when there is some shuffling process happening. After that, the network utilization should come down, while each slave nodes will do the computation on the partitions assigned to them. At least it is my understanding. Best, Jula

Re: Identify the performance bottleneck from hardware prospective

2015-03-05 Thread davidkl
your case as well? Regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Identify-the-performance-bottleneck-from-hardware-prospective-tp21684p21927.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Akhil Das
It would be good if you can share the piece of code that you are using, so people can suggest you how to optimize it further and stuffs like that. Also, since you are having 20Gb of memory and ~30Gb of data, you can try doing a rdd.persist(StorageLevel.MEMORY_AND_DISK_SER) or .persist(StorageLevel.

Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Julaiti Alafate
The raw data is ~30 GB. It consists of 250 millions sentences. The total length of the documents (i.e. the sum of the length of all sentences) is 11 billions. I also ran a simple algorithm to roughly count the maximum number of word pairs by summing up d * (d - 1) over all sentences, where d is the

Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Arush Kharbanda
Hi How big is your dataset? Thanks Arush On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate wrote: > Thank you very much for your reply! > > My task is to count the number of word pairs in a document. If w1 and w2 > occur together in one sentence, the number of occurrence of word pair (w1, > w2)

Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Julaiti Alafate
Thank you very much for your reply! My task is to count the number of word pairs in a document. If w1 and w2 occur together in one sentence, the number of occurrence of word pair (w1, w2) adds 1. So the computational part of this algorithm is simply a two-level for-loop. Since the cluster is moni

Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Akhil Das
What application are you running? Here's a few things: - You will hit bottleneck on CPU if you are doing some complex computation (like parsing a json etc.) - You will hit bottleneck on Memory if your data/objects used in the program is large (like defining playing with HashMaps etc inside your ma

Identify the performance bottleneck from hardware prospective

2015-02-16 Thread jalafate
hardware perspective? Thanks! Julaiti -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Identify-the-performance-bottleneck-from-hardware-prospective-tp21684.html Sent from the Apache Spark User List mailing list a

Identify the performance bottleneck from hardware prospective

2015-02-16 Thread Julaiti Alafate
Hi there, I am trying to scale up the data size that my application is handling. This application is running on a cluster with 16 slave nodes. Each slave node has 60GB memory. It is running in standalone mode. The data is coming from HDFS that also in same local network. In order to have an under