Re: Spark stages very slow to complete

2015-06-02 Thread Karlson
Hi, the code is some hundreds lines of Python. I can try to compose a minimal example as soon as I find the time, though. Any ideas until then? Would you mind posting the code? On 2 Jun 2015 00:53, "Karlson" wrote: Hi, In all (pyspark) Spark jobs, that become somewhat more invo

Spark stages very slow to complete

2015-06-01 Thread Karlson
Hi, In all (pyspark) Spark jobs, that become somewhat more involved, I am experiencing the issue that some stages take a very long time to complete and sometimes don't at all. This clearly correlates with the size of my input data. Looking at the stage details for one such stage, I am wonderi

Re: Partitioning of Dataframes

2015-05-22 Thread Karlson
Alright, that doesn't seem to have made it into the Python API yet. On 2015-05-22 15:12, Silvio Fiorito wrote: This is added to 1.4.0 https://github.com/apache/spark/pull/5762 On 5/22/15, 8:48 AM, "Karlson" wrote: Hi, wouldn't df.rdd.partitionBy() return a new R

Re: Partitioning of Dataframes

2015-05-22 Thread Karlson
5-22 12:55, ayan guha wrote: DataFrame is an abstraction of rdd. So you should be able to do df.rdd.partitioyBy. however as far as I know, equijoines already optimizes partitioning. You may want to look explain plans more carefully and materialise interim joins. On 22 May 2015 19:03, "

Partitioning of Dataframes

2015-05-22 Thread Karlson
Hi, is there any way to control how Dataframes are partitioned? I'm doing lots of joins and am seeing very large shuffle reads and writes in the Spark UI. With PairRDDs you can control how the data is partitioned across nodes with partitionBy. There is no such method on Dataframes however. Ca

Re: [pyspark] Starting workers in a virtualenv

2015-05-21 Thread Karlson
That works, thank you! On 2015-05-22 03:15, Davies Liu wrote: Could you try with specify PYSPARK_PYTHON to the path of python in your virtual env, for example PYSPARK_PYTHON=/path/to/env/bin/python bin/spark-submit xx.py On Mon, Apr 20, 2015 at 12:51 AM, Karlson wrote: Hi all, I am running

Re: Join on DataFrames from the same source (Pyspark)

2015-04-22 Thread Karlson
es. However you can add an alias as follows: from pyspark.sql.functions import * df.alias("a").join(df.alias("b"), col("a.col1") == col("b.col1")) On Tue, Apr 21, 2015 at 8:10 AM, Karlson wrote: Sorry, my code actually was df_one = df.select('co

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread Karlson
uha wrote: your code should be df_one = df.select('col1', 'col2') df_two = df.select('col1', 'col3') Your current code is generating a tupple, and of course df_1 and df_2 are different, so join is yielding to cartesian. Best Ayan On Wed, Apr 22, 201

Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread Karlson
Hi, can anyone confirm (and if so elaborate on) the following problem? When I join two DataFrames that originate from the same source DataFrame, the resulting DF will explode to a huge number of rows. A quick example: I load a DataFrame with n rows from disk: df = sql_context.parquetFil

[pyspark] Starting workers in a virtualenv

2015-04-20 Thread Karlson
Hi all, I am running the Python process that communicates with Spark in a virtualenv. Is there any way I can make sure that the Python processes of the workers are also started in a virtualenv? Currently I am getting ImportErrors when the worker tries to unpickle stuff that is not installed s

Save and read parquet from the same path

2015-03-04 Thread Karlson
Hi all, what would happen if I save a RDD via saveAsParquetFile to the same path that RDD is originally read from? Is that a safe thing to do in Pyspark? Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org Fo

Re: Shuffle on joining two RDDs

2015-02-13 Thread Karlson
(2, v)) ? As I understand, this would preserve the original partitioning. On 2015-02-13 12:43, Karlson wrote: Does that mean partitioning does not work in Python? Or does this only effect joining? On 2015-02-12 19:27, Davies Liu wrote: The feature works as expected in Scala/Java, but not implement

Re: Shuffle on joining two RDDs

2015-02-13 Thread Karlson
e.spark.OneToOneDependency@7bc172ec) (d3 ShuffledRDD[12] at groupByKey at :12,org.apache.spark.ShuffleDependency@d794984) (MappedRDD[11] at map at :12,org.apache.spark.OneToOneDependency@15c98005) On Thu, Feb 12, 2015 at 10:05 AM, Karlson wrote: Hi Imran, thanks for your quick reply. Actua

Re: Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi, I believe that partitionBy will use the same (default) partitioner on both RDDs. On 2015-02-12 17:12, Sean Owen wrote: Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid wrote: Hi Karlson, I think your assumptions are co

Re: Shuffle on joining two RDDs

2015-02-12 Thread Karlson
the resulting RDD (I am using the Python API). There is however foreachPartition(). What is the line joinedRdd.getPartitions.foreach{println} supposed to do? Thank you, Karlson PS: Sorry for sending this twice, I accidentally did not reply to the mailing list first. On 2015-02-12 16:48

Shuffle on joining two RDDs

2015-02-12 Thread Karlson
bout 50MB. What's the explanation for that behaviour? Where am I wrong with my assumption? Thanks in advance, Karlson - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands,