to read about 120
>> thousand avro files into a single data frame.
>>
>> Is your patch part of a pull request from the master branch in github?
>>
>> Thanks,
>> Prasad.
>>
>> From: Anders Arpteg
>> Date: Thursday, October 22, 2
Thanks, Koert.
Regards,
Prasad.
From: Koert Kuipers
Date: Thursday, December 17, 2015 at 1:06 PM
To: Prasad Ravilla
Cc: Anders Arpteg, user
Subject: Re: Large number of conf broadcasts
https://github.com/databricks/spark-avro/pull/95<https://urldefense.proofpoint.com/v2/url?u=ht
l request from the master branch in github?
>
> Thanks,
> Prasad.
>
> From: Anders Arpteg
> Date: Thursday, October 22, 2015 at 10:37 AM
> To: Koert Kuipers
> Cc: user
> Subject: Re: Large number of conf broadcasts
>
> Yes, seems unnecessary. I actually tried patc
Kuipers
Cc: user
Subject: Re: Large number of conf broadcasts
Yes, seems unnecessary. I actually tried patching the com.databricks.spark.avro
reader to only broadcast once per dataset, instead of every single
file/partition. It seems to work just as fine, and there are significantly less
Nice Koert, lets hope it gets merged soon.
/Anders
On Fri, Oct 23, 2015 at 6:32 PM Koert Kuipers wrote:
> https://github.com/databricks/spark-avro/pull/95
>
> On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers wrote:
>
>> oh no wonder... it undoes the glob (i was reading from /some/path/*),
>> cre
https://github.com/databricks/spark-avro/pull/95
On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers wrote:
> oh no wonder... it undoes the glob (i was reading from /some/path/*),
> creates a hadoopRdd for every path, and then creates a union of them using
> UnionRDD.
>
> thats not what i want... no
oh no wonder... it undoes the glob (i was reading from /some/path/*),
creates a hadoopRdd for every path, and then creates a union of them using
UnionRDD.
thats not what i want... no need to do union. AvroInpuFormat already has
the ability to handle globs (or multiple paths comma separated) very
e
Yes, seems unnecessary. I actually tried patching the
com.databricks.spark.avro reader to only broadcast once per dataset,
instead of every single file/partition. It seems to work just as fine, and
there are significantly less broadcasts and not seeing out of memory issues
any more. Strange that mo
i am seeing the same thing. its gona completely crazy creating broadcasts
for the last 15 mins or so. killing it...
On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg wrote:
> Hi,
>
> Running spark 1.5.0 in yarn-client mode, and am curios in why there are so
> many broadcast being done when loading