Hi, the code is some hundreds lines of Python. I can try to compose a
minimal example as soon as I find the time, though. Any ideas until
then?
Would you mind posting the code?
On 2 Jun 2015 00:53, "Karlson" wrote:
Hi,
In all (pyspark) Spark jobs, that become somewhat more invo
Hi,
In all (pyspark) Spark jobs, that become somewhat more involved, I am
experiencing the issue that some stages take a very long time to
complete and sometimes don't at all. This clearly correlates with the
size of my input data. Looking at the stage details for one such stage,
I am wonderi
Alright, that doesn't seem to have made it into the Python API yet.
On 2015-05-22 15:12, Silvio Fiorito wrote:
This is added to 1.4.0
https://github.com/apache/spark/pull/5762
On 5/22/15, 8:48 AM, "Karlson" wrote:
Hi,
wouldn't df.rdd.partitionBy() return a new R
5-22 12:55, ayan guha wrote:
DataFrame is an abstraction of rdd. So you should be able to do
df.rdd.partitioyBy. however as far as I know, equijoines already
optimizes
partitioning. You may want to look explain plans more carefully and
materialise interim joins.
On 22 May 2015 19:03, "
Hi,
is there any way to control how Dataframes are partitioned? I'm doing
lots of joins and am seeing very large shuffle reads and writes in the
Spark UI. With PairRDDs you can control how the data is partitioned
across nodes with partitionBy. There is no such method on Dataframes
however. Ca
That works, thank you!
On 2015-05-22 03:15, Davies Liu wrote:
Could you try with specify PYSPARK_PYTHON to the path of python in
your virtual env, for example
PYSPARK_PYTHON=/path/to/env/bin/python bin/spark-submit xx.py
On Mon, Apr 20, 2015 at 12:51 AM, Karlson wrote:
Hi all,
I am running
es. However you can add an alias as
follows:
from pyspark.sql.functions import *
df.alias("a").join(df.alias("b"), col("a.col1") == col("b.col1"))
On Tue, Apr 21, 2015 at 8:10 AM, Karlson wrote:
Sorry, my code actually was
df_one = df.select('co
uha wrote:
your code should be
df_one = df.select('col1', 'col2')
df_two = df.select('col1', 'col3')
Your current code is generating a tupple, and of course df_1 and df_2
are
different, so join is yielding to cartesian.
Best
Ayan
On Wed, Apr 22, 201
Hi,
can anyone confirm (and if so elaborate on) the following problem?
When I join two DataFrames that originate from the same source
DataFrame, the resulting DF will explode to a huge number of rows. A
quick example:
I load a DataFrame with n rows from disk:
df = sql_context.parquetFil
Hi all,
I am running the Python process that communicates with Spark in a
virtualenv. Is there any way I can make sure that the Python processes
of the workers are also started in a virtualenv? Currently I am getting
ImportErrors when the worker tries to unpickle stuff that is not
installed s
Hi all,
what would happen if I save a RDD via saveAsParquetFile to the same path
that RDD is originally read from? Is that a safe thing to do in Pyspark?
Thanks!
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
Fo
(2, v))
?
As I understand, this would preserve the original partitioning.
On 2015-02-13 12:43, Karlson wrote:
Does that mean partitioning does not work in Python? Or does this only
effect joining?
On 2015-02-12 19:27, Davies Liu wrote:
The feature works as expected in Scala/Java, but not implement
e.spark.OneToOneDependency@7bc172ec)
(d3 ShuffledRDD[12] at groupByKey at
:12,org.apache.spark.ShuffleDependency@d794984)
(MappedRDD[11] at map at
:12,org.apache.spark.OneToOneDependency@15c98005)
On Thu, Feb 12, 2015 at 10:05 AM, Karlson
wrote:
Hi Imran,
thanks for your quick reply.
Actua
Hi,
I believe that partitionBy will use the same (default) partitioner on
both RDDs.
On 2015-02-12 17:12, Sean Owen wrote:
Doesn't this require that both RDDs have the same partitioner?
On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid
wrote:
Hi Karlson,
I think your assumptions are co
the resulting RDD (I am using
the Python API). There is however foreachPartition(). What is the line
joinedRdd.getPartitions.foreach{println}
supposed to do?
Thank you,
Karlson
PS: Sorry for sending this twice, I accidentally did not reply to the
mailing list first.
On 2015-02-12 16:48
bout
50MB.
What's the explanation for that behaviour? Where am I wrong with my
assumption?
Thanks in advance,
Karlson
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands,
16 matches
Mail list logo