Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-24 Thread Praveen Devarao
Thanks Reynold for the reason as to why sortBykey invokes a Job When you say "DataFrame/Dataset does not have this issue" is it right to assume you are referring to Spark 2.0 or Spark 1.6 DF already has built-in it? Thanking You --

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-24 Thread Reynold Xin
Usually no - but sortByKey does because it needs the range boundary to be built in order to have the RDD. It is a long standing problem that's unfortunately very difficult to solve without breaking the RDD API. In DataFrame/Dataset we don't have this issue though. On Sun, Apr 24, 2016 at 10:54 P

Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-24 Thread Praveen Devarao
Hi, I have a streaming program with the block as below [ref: https://github.com/agsachin/streamingBenchmark/blob/master/spark-benchmarks/src/main/scala/TwitterStreaming.scala ] 1 val lines = messages.map(_._2) 2 val hashTags = lines.flatMap(status => status.split(" " ).filter(_.startsWit

Spark sql with large sql syntax job failed with outofmemory error and grows beyond 64k warn

2016-04-24 Thread FangFang Chen
Hi all, With large sql command, job failed with following error. Please give your suggestion on how to resolve it. Thanks Sql file size: 676k Log: 16/04/25 10:55:00 WARN TaskSetManager: Lost task 84.0 in stage 0.0 (TID 6, BJHC-HADOOP-HERA-17493.jd.local): java.util.concurrent.ExecutionException

Re: executor delay in Spark

2016-04-24 Thread Jeff Zhang
Maybe this is due to config spark.scheduler.minRegisteredResourcesRatio, you can try set it as 1 to see the behavior. // Submit tasks only after (registered resources / total expected resources) // is equal to at least this value, that is double between 0 and 1. var minRegisteredRatio = math.m

Re: executor delay in Spark

2016-04-24 Thread Mike Hynes
Could you change numPartitions to {16, 32, 64} and run your program for each to see how many partitions are allocated to each worker? Let's see if you experience an all-nothing imbalance that way; if so, my guess is that something else is odd in your program logic or spark runtime environment, but

net.razorvine.pickle.PickleException in Pyspark

2016-04-24 Thread Caique Marques
Hello, everyone! I'm trying to implement the association rules in Python. I got implement an association by a frequent element, works as expected (example can be seen here ). Now, my challen