warn you that groupByKey() is not a recommended
operation if you can avoid it, as it has non-obvious performance issues when
running with serious data.
On Sat, Jul 12, 2014 at 12:20 PM, Guanhua Yan wrote:
> Hi:
>
> I have trouble understanding the default partitioner (hash) in Spark. Sup
Hi:
I have trouble understanding the default partitioner (hash) in Spark.
Suppose that an RDD with two partitions is created as follows:
x = sc.parallelize([("a", 1), ("b", 4), ("a", 10), ("c", 7)], 2)
Does spark partition x based on the hash of the key (e.g., "a", "b", "c") by
default?
(1) Assumi
13/14 12:10 AM, "Xiangrui Meng" wrote:
>You have a long lineage that causes the StackOverflow error. Try
>rdd.checkPoint() and rdd.count() for every 20~30 iterations.
>checkPoint can cut the lineage. -Xiangrui
>
>On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan wrote:
>
Dear Sparkers:
I am using Python spark of version 0.9.0 to implement some iterative
algorithm. I got some errors shown at the end of this email. It seems that
it's due to the Java Stack Overflow error. The same error has been
duplicated on a mac desktop and a linux workstation, both running the sa
, at 9:51 AM, Guanhua Yan wrote:
> Hi all:
>
> Is it possible to develop Spark programs in Python and run them on YARN? From
> the Python SparkContext class, it doesn't seem to have such an option.
>
> Thank you,
> - Guanhua
>
> ===========
&g
Hi all:
Is it possible to develop Spark programs in Python and run them on YARN?
>From the Python SparkContext class, it doesn't seem to have such an option.
Thank you,
- Guanhua
===
Guanhua Yan, Ph.D.
Information Sciences Group (CCS-3)
Los Alamos National La