Re: Confused by groupByKey() and the default partitioner

2014-07-13 Thread Guanhua Yan
warn you that groupByKey() is not a recommended operation if you can avoid it, as it has non-obvious performance issues when running with serious data. On Sat, Jul 12, 2014 at 12:20 PM, Guanhua Yan wrote: > Hi: > > I have trouble understanding the default partitioner (hash) in Spark. Sup

Confused by groupByKey() and the default partitioner

2014-07-12 Thread Guanhua Yan
Hi: I have trouble understanding the default partitioner (hash) in Spark. Suppose that an RDD with two partitions is created as follows: x = sc.parallelize([("a", 1), ("b", 4), ("a", 10), ("c", 7)], 2) Does spark partition x based on the hash of the key (e.g., "a", "b", "c") by default? (1) Assumi

Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Guanhua Yan
13/14 12:10 AM, "Xiangrui Meng" wrote: >You have a long lineage that causes the StackOverflow error. Try >rdd.checkPoint() and rdd.count() for every 20~30 iterations. >checkPoint can cut the lineage. -Xiangrui > >On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan wrote: >

java.lang.StackOverflowError when calling count()

2014-05-12 Thread Guanhua Yan
Dear Sparkers: I am using Python spark of version 0.9.0 to implement some iterative algorithm. I got some errors shown at the end of this email. It seems that it's due to the Java Stack Overflow error. The same error has been duplicated on a mac desktop and a linux workstation, both running the sa

Re: Python Spark on YARN

2014-04-29 Thread Guanhua Yan
, at 9:51 AM, Guanhua Yan wrote: > Hi all: > > Is it possible to develop Spark programs in Python and run them on YARN? From > the Python SparkContext class, it doesn't seem to have such an option. > > Thank you, > - Guanhua > > =========== &g

Python Spark on YARN

2014-04-29 Thread Guanhua Yan
Hi all: Is it possible to develop Spark programs in Python and run them on YARN? >From the Python SparkContext class, it doesn't seem to have such an option. Thank you, - Guanhua === Guanhua Yan, Ph.D. Information Sciences Group (CCS-3) Los Alamos National La