Nice Koert, lets hope it gets merged soon.

/Anders

On Fri, Oct 23, 2015 at 6:32 PM Koert Kuipers <ko...@tresata.com> wrote:

> https://github.com/databricks/spark-avro/pull/95
>
> On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> oh no wonder... it undoes the glob (i was reading from /some/path/*),
>> creates a hadoopRdd for every path, and then creates a union of them using
>> UnionRDD.
>>
>> thats not what i want... no need to do union. AvroInpuFormat already has
>> the ability to handle globs (or multiple paths comma separated) very
>> efficiently. AvroRelation should just pass the paths (comma separated).
>>
>>
>>
>>
>> On Thu, Oct 22, 2015 at 1:37 PM, Anders Arpteg <arp...@spotify.com>
>> wrote:
>>
>>> Yes, seems unnecessary. I actually tried patching the
>>> com.databricks.spark.avro reader to only broadcast once per dataset,
>>> instead of every single file/partition. It seems to work just as fine, and
>>> there are significantly less broadcasts and not seeing out of memory issues
>>> any more. Strange that more people does not react to this, since the
>>> broadcasting seems completely unnecessary...
>>>
>>> Best,
>>> Anders
>>>
>>>
>>> On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> i am seeing the same thing. its gona completely crazy creating
>>>> broadcasts for the last 15 mins or so. killing it...
>>>>
>>>> On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg <arp...@spotify.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Running spark 1.5.0 in yarn-client mode, and am curios in why there
>>>>> are so many broadcast being done when loading datasets with large number 
>>>>> of
>>>>> partitions/files. Have datasets with thousands of partitions, i.e. hdfs
>>>>> files in the avro folder, and sometime loading hundreds of these large
>>>>> datasets. Believe I have located the broadcast to line
>>>>> SparkContext.scala:1006. It seems to just broadcast the hadoop
>>>>> configuration, and I don't see why it should be necessary to broadcast 
>>>>> that
>>>>> for EVERY file? Wouldn't it be possible to reuse the same broadcast
>>>>> configuration? It hardly the case the the configuration would be different
>>>>> between each file in a single dataset. Seems to be wasting lots of memory
>>>>> and needs to persist unnecessarily to disk (see below again).
>>>>>
>>>>> Thanks,
>>>>> Anders
>>>>>
>>>>> 15/09/24 17:11:11 INFO BlockManager: Writing block
>>>>> broadcast_1871_piece0 to disk
>>>>>  [19/49086]15/09/24 17:11:11 INFO BlockManagerInfo: Added
>>>>> broadcast_1871_piece0 on disk on 10.254.35.24:49428 (size: 23.1 KB)
>>>>> 15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored
>>>>> as bytes in memory (estimated size 23.1 KB, free 2.4 KB)
>>>>> 15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0
>>>>> in memory on 10.254.35.24:49428 (size: 23.1 KB, free: 464.0 MB)
>>>>> 15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803
>>>>> from hadoopFile at AvroRelation.scala:121
>>>>> 15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory
>>>>> threshold of 1024.0 KB for computing block broadcast_4804 in memory
>>>>> .
>>>>> 15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache
>>>>> broadcast_4804 in memory! (computed 496.0 B so far)
>>>>> 15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) +
>>>>> 0.0 B (scratch space shared across 0 tasks(s)) = 530.3 MB. Storage
>>>>> limit = 530.3 MB.
>>>>> 15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to
>>>>> disk instead.
>>>>> 15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with
>>>>> curMem=556036460, maxMem=556038881
>>>>> 15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping
>>>>> 15/09/24 17:11:11 INFO BlockManager: Dropping block
>>>>> broadcast_1872_piece0 from memory
>>>>> 15/09/24 17:11:11 INFO BlockManager: Writing block
>>>>> broadcast_1872_piece0 to disk
>>>>>
>>>>>
>>>>
>>>>
>>
>

Reply via email to