Re: flatMap output on disk / flatMap memory overhead

2015-08-01 Thread Puneet Kapoor
Hi Ocatavian, Just out of curiosity, did you try persisting your RDD in serialized format "MEMORY_AND_DISK_SER" or "MEMORY_ONLY_SER" ?? i.e. changing your : "rdd.persist(MEMORY_AND_DISK)" to "rdd.persist(MEMORY_ONLY_SER)" Regards On Wed, Jun 10, 2015 at 7:27 AM, Imran Rashid wrote: > I agree

Re: Spark SQL DataFrame: Nullable column and filtering

2015-08-01 Thread Martin Senne
Dear all, after some fiddling I have arrived at this solution: /** * Customized left outer join on common column. */ def leftOuterJoinWithRemovalOfEqualColumn(leftDF: DataFrame, rightDF: DataFrame, commonColumnName: String): DataFrame = { val joinedDF = leftDF.as('left).join(rightDF.as('right

About memory leak in spark 1.4.1

2015-08-01 Thread Sea
Hi, all I upgrage spark to 1.4.1, many applications failed... I find the heap memory is not full , but the process of CoarseGrainedExecutorBackend will take more memory than I expect, and it will increase as time goes on, finally more than max limited of the server, the worker will die. An

No event logs in yarn-cluster mode

2015-08-01 Thread Akmal Abbasov
Hi, I am trying to configure a history server for application. When I running locally(./run-example SparkPi), the event logs are being created, and I can start history server. But when I am trying ./spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster file:///opt/hadoop/s

Re: Does anyone have experience with using Hadoop InputFormats?

2015-08-01 Thread Antsy.Rao
Sent from my iPad On 2014-9-24, at 上午8:13, Steve Lewis wrote: > When I experimented with using an InputFormat I had used in Hadoop for a > long time in Hadoop I found > 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated > class not org.apache.hadoop.mapreduce.lib.inp

Re: No event logs in yarn-cluster mode

2015-08-01 Thread Andrew Or
Hi Akmal, It might be on HDFS, since you provided a relative path /opt/spark/spark-events to `spark.eventLog.dir`. -Andrew 2015-08-01 9:25 GMT-07:00 Akmal Abbasov : > Hi, I am trying to configure a history server for application. > When I running locally(./run-example SparkPi), the event logs a

Re: No event logs in yarn-cluster mode

2015-08-01 Thread Marcelo Vanzin
On Sat, Aug 1, 2015 at 9:25 AM, Akmal Abbasov wrote: > When I running locally(./run-example SparkPi), the event logs are being > created, and I can start history server. > But when I am trying > ./spark-submit --class org.apache.spark.examples.SparkPi --master > yarn-cluster file:///opt/hadoop/sp

Re: Spark Number of Partitions Recommendations

2015-08-01 Thread Ruslan Dautkhanov
You should also take into account amount of memory that you plan to use. It's advised not to give too much memory for each executor .. otherwise GC overhead will go up. Btw, why prime numbers? -- Ruslan Dautkhanov On Wed, Jul 29, 2015 at 3:31 AM, ponkin wrote: > Hi Rahul, > > Where did you

Re: How does the # of tasks affect # of threads?

2015-08-01 Thread Fabrice Sznajderman
Hello, I am not an expert with Spark, but the error thrown by spark seems indicate that not enough memory for launching job. By default, Spark allocated 1GB for memory, may be you should increase it ? Best regards Fabrice Le sam. 1 août 2015 à 22:51, Connor Zanin a écrit : > Hello, > > I am h

Re: How does the # of tasks affect # of threads?

2015-08-01 Thread Connor Zanin
1. I believe that the default memory (per executor) is 512m (from the documentation) 2. I have increased the memory used by spark on workers in my launch script when submitting the job (--executor-memory 124g) 3. The job completes successfully, it is the "road bumps" in the middle I am conce

TCP/IP speedup

2015-08-01 Thread Simon Edelhaus
Hi All! How important would be a significant performance improvement to TCP/IP itself, in terms of overall job performance improvement. Which part would be most significantly accelerated? Would it be HDFS? -- ttfn Simon Edelhaus California 2015

Re: TCP/IP speedup

2015-08-01 Thread Mark Hamstra
https://spark-summit.org/2015/events/making-sense-of-spark-performance/ On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus wrote: > Hi All! > > How important would be a significant performance improvement to TCP/IP > itself, in terms of > overall job performance improvement. Which part would be most

Re: TCP/IP speedup

2015-08-01 Thread Simon Edelhaus
H 2% huh. -- ttfn Simon Edelhaus California 2015 On Sat, Aug 1, 2015 at 3:45 PM, Mark Hamstra wrote: > https://spark-summit.org/2015/events/making-sense-of-spark-performance/ > > On Sat, Aug 1, 2015 at 3:24 PM, Simon Edelhaus wrote: > >> Hi All! >> >> How important would be a signifi

Re: TCP/IP speedup

2015-08-01 Thread Ruslan Dautkhanov
If your network is bandwidth-bound, you'll see setting jumbo frames (MTU 9000) may increase bandwidth up to ~20%. http://docs.hortonworks.com/HDP2Alpha/index.htm#Hardware_Recommendations_for_Hadoop.htm "Enabling Jumbo Frames across the cluster improves bandwidth" If Spark workload is not network

Re: Spark Number of Partitions Recommendations

2015-08-01 Thread Понькин Алексей
Yes, I forgot to mention I chose prime number as a modulo for hash function because my keys are usually strings and spark calculates particular partitiion using key hash(see HashPartitioner.scala) So, to avoid big number of collisions(when many keys located in few partition) it is common to use