I have the following code:
from pyspark import SQLContext
d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice',
'country': 'ire', 'colou
still feels like a bug to have to create unique names before a join.
On Fri, Jun 26, 2015 at 9:51 PM, ayan guha wrote:
> You can declare the schema with unique names before creation of df.
> On 27 Jun 2015 13:01, "Axel Dahl" wrote:
>
>>
>> I have the following
vs 1.4?
>
> Nick
> 2015년 6월 27일 (토) 오전 2:51, Axel Dahl >님이 작성:
>
>> still feels like a bug to have to create unique names before a join.
>>
>> On Fri, Jun 26, 2015 at 9:51 PM, ayan guha > > wrote:
>>
>>> You can declare the schema with unique
For example, I recently stumbled upon this
>> issue <https://issues.apache.org/jira/browse/SPARK-8670> which was
>> specific to 1.4.
>>
>> On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl
>> wrote:
>>
>>> I've only tested on 1.4, but imagine 1.3 is th
In pyspark, when I convert from rdds to dataframes it looks like the rdd is
being materialized/collected/repartitioned before it's converted to a
dataframe.
Just wondering if there's any guidelines for doing this conversion and
whether it's best to do it early to get the performance benefits of
da
I have a 4 node cluster and have been playing around with the num-executors
parameters, executor-memory and executor-cores
I set the following:
--executor-memory=10G
--num-executors=1
--executor-cores=8
But when I run the job, I see that each worker, is running one executor
which has 2 cores and
then there would
>> be only 1 core on some executor and you'll get what you want
>>
>> on the other hand, if you application needs all cores of your cluster and
>> only some specific job should run on single executor there are few methods
>> to achieve this
>>
, but only executes
everything on 1 node, it looks like it's not grabbing the extra nodes.
On Wed, Aug 19, 2015 at 8:43 AM, Axel Dahl wrote:
> That worked great, thanks Andrew.
>
> On Tue, Aug 18, 2015 at 1:39 PM, Andrew Or wrote:
>
>> Hi Axel,
>>
>> You can
in my spark-defaults.conf I have:
spark.files file1.zip, file2.py
spark.master spark://master.domain.com:7077
If I execute:
bin/pyspark
I can see it adding the files correctly.
However if I execute
bin/spark-submit test.py
where test.py relies on the file1.zip, I get an
bug, could you create a JIRA for it?
>
> On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl wrote:
> > in my spark-defaults.conf I have:
> > spark.files file1.zip, file2.py
> > spark.master spark://master.domain.com:7077
> >
> > If I execute:
> &
logged it here:
https://issues.apache.org/jira/browse/SPARK-10436
On Thu, Sep 3, 2015 at 10:32 AM, Davies Liu wrote:
> I think it's a missing feature.
>
> On Wed, Sep 2, 2015 at 10:58 PM, Axel Dahl wrote:
> > So a bit more investigation, shows that:
> >
>
I have a join, that fails when one of the data frames is empty.
To avoid this I am hoping to check if the dataframe is empty or not before
the join.
The question is what's the most performant way to do that?
should I do df.count() or df.first() or something else?
Thanks in advance,
-Axel
12 matches
Mail list logo