dataframe left joins are not working as expected in pyspark

2015-06-26 Thread Axel Dahl
I have the following code: from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice', 'country': 'ire', 'colou

Re: dataframe left joins are not working as expected in pyspark

2015-06-26 Thread Axel Dahl
still feels like a bug to have to create unique names before a join. On Fri, Jun 26, 2015 at 9:51 PM, ayan guha wrote: > You can declare the schema with unique names before creation of df. > On 27 Jun 2015 13:01, "Axel Dahl" wrote: > >> >> I have the following

Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Axel Dahl
vs 1.4? > > Nick > 2015년 6월 27일 (토) 오전 2:51, Axel Dahl >님이 작성: > >> still feels like a bug to have to create unique names before a join. >> >> On Fri, Jun 26, 2015 at 9:51 PM, ayan guha > > wrote: >> >>> You can declare the schema with unique

Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Axel Dahl
For example, I recently stumbled upon this >> issue <https://issues.apache.org/jira/browse/SPARK-8670> which was >> specific to 1.4. >> >> On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl >> wrote: >> >>> I've only tested on 1.4, but imagine 1.3 is th

is there any significant performance issue converting between rdd and dataframes in pyspark?

2015-06-29 Thread Axel Dahl
In pyspark, when I convert from rdds to dataframes it looks like the rdd is being materialized/collected/repartitioned before it's converted to a dataframe. Just wondering if there's any guidelines for doing this conversion and whether it's best to do it early to get the performance benefits of da

how do I execute a job on a single worker node in standalone mode

2015-08-17 Thread Axel Dahl
I have a 4 node cluster and have been playing around with the num-executors parameters, executor-memory and executor-cores I set the following: --executor-memory=10G --num-executors=1 --executor-cores=8 But when I run the job, I see that each worker, is running one executor which has 2 cores and

Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Axel Dahl
then there would >> be only 1 core on some executor and you'll get what you want >> >> on the other hand, if you application needs all cores of your cluster and >> only some specific job should run on single executor there are few methods >> to achieve this >>

Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Axel Dahl
, but only executes everything on 1 node, it looks like it's not grabbing the extra nodes. On Wed, Aug 19, 2015 at 8:43 AM, Axel Dahl wrote: > That worked great, thanks Andrew. > > On Tue, Aug 18, 2015 at 1:39 PM, Andrew Or wrote: > >> Hi Axel, >> >> You can

spark-submit not using conf/spark-defaults.conf

2015-09-02 Thread Axel Dahl
in my spark-defaults.conf I have: spark.files file1.zip, file2.py spark.master spark://master.domain.com:7077 If I execute: bin/pyspark I can see it adding the files correctly. However if I execute bin/spark-submit test.py where test.py relies on the file1.zip, I get an

Re: spark-submit not using conf/spark-defaults.conf

2015-09-02 Thread Axel Dahl
bug, could you create a JIRA for it? > > On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl wrote: > > in my spark-defaults.conf I have: > > spark.files file1.zip, file2.py > > spark.master spark://master.domain.com:7077 > > > > If I execute: > &

Re: spark-submit not using conf/spark-defaults.conf

2015-09-03 Thread Axel Dahl
logged it here: https://issues.apache.org/jira/browse/SPARK-10436 On Thu, Sep 3, 2015 at 10:32 AM, Davies Liu wrote: > I think it's a missing feature. > > On Wed, Sep 2, 2015 at 10:58 PM, Axel Dahl wrote: > > So a bit more investigation, shows that: > > >

performance when checking if data frame is empty or not

2015-09-08 Thread Axel Dahl
I have a join, that fails when one of the data frames is empty. To avoid this I am hoping to check if the dataframe is empty or not before the join. The question is what's the most performant way to do that? should I do df.count() or df.first() or something else? Thanks in advance, -Axel