https://github.com/databricks/spark-avro/pull/95

On Thu, Dec 17, 2015 at 3:35 PM, Prasad Ravilla <pras...@slalom.com> wrote:

> Hi Anders,
>
> I am running into the same issue as yours. I am trying to read about 120
> thousand avro files into a single data frame.
>
> Is your patch part of a pull request from the master branch in github?
>
> Thanks,
> Prasad.
>
> From: Anders Arpteg
> Date: Thursday, October 22, 2015 at 10:37 AM
> To: Koert Kuipers
> Cc: user
> Subject: Re: Large number of conf broadcasts
>
> Yes, seems unnecessary. I actually tried patching the
> com.databricks.spark.avro reader to only broadcast once per dataset,
> instead of every single file/partition. It seems to work just as fine, and
> there are significantly less broadcasts and not seeing out of memory issues
> any more. Strange that more people does not react to this, since the
> broadcasting seems completely unnecessary...
>
> Best,
> Anders
>
> On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers <ko...@tresata.com> wrote:
>
>> i am seeing the same thing. its gona completely crazy creating broadcasts
>> for the last 15 mins or so. killing it...
>>
>> On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg <arp...@spotify.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Running spark 1.5.0 in yarn-client mode, and am curios in why there are
>>> so many broadcast being done when loading datasets with large number of
>>> partitions/files. Have datasets with thousands of partitions, i.e. hdfs
>>> files in the avro folder, and sometime loading hundreds of these large
>>> datasets. Believe I have located the broadcast to line
>>> SparkContext.scala:1006. It seems to just broadcast the hadoop
>>> configuration, and I don't see why it should be necessary to broadcast that
>>> for EVERY file? Wouldn't it be possible to reuse the same broadcast
>>> configuration? It hardly the case the the configuration would be different
>>> between each file in a single dataset. Seems to be wasting lots of memory
>>> and needs to persist unnecessarily to disk (see below again).
>>>
>>> Thanks,
>>> Anders
>>>
>>> 15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1871_piece0
>>> to disk                                              [19/49086]15/09/24
>>> 17:11:11 INFO BlockManagerInfo: Added broadcast_1871_piece0 on disk on
>>> 10.254.35.24:49428
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=>
>>> (size: 23.1 KB)
>>> 15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored
>>> as bytes in memory (estimated size 23.1 KB, free 2.4 KB)
>>> 15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0 in
>>> memory on 10.254.35.24:49428
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.254.35.24-3A49428&d=AAMFaQ&c=fa_WZs7nNMvOIDyLmzi2sMVHyyC4hN9WQl29lWJQ5Y4&r=-5JY3iMOXXyFuBleKruCQ-6rGWyZEyiHu8ySSzJdEHw&m=l2yANY7xVKKwiFwzeDzKhyU0PGja-46MWiTFMCmhYH8&s=JWqID_Bk5XTujNC34_AAgssnJp-X3ocZ79BgAwGOLbQ&e=>
>>> (size: 23.1 KB, free: 464.0 MB)
>>> 15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803 from
>>> hadoopFile at AvroRelation.scala:121
>>> 15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory
>>> threshold of 1024.0 KB for computing block broadcast_4804 in memory
>>> .
>>> 15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache
>>> broadcast_4804 in memory! (computed 496.0 B so far)
>>> 15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) + 0.0
>>> B (scratch space shared across 0 tasks(s)) = 530.3 MB. Storage
>>> limit = 530.3 MB.
>>> 15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to
>>> disk instead.
>>> 15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with
>>> curMem=556036460, maxMem=556038881
>>> 15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping
>>> 15/09/24 17:11:11 INFO BlockManager: Dropping block
>>> broadcast_1872_piece0 from memory
>>> 15/09/24 17:11:11 INFO BlockManager: Writing block broadcast_1872_piece0
>>> to disk
>>>
>>>
>>
>>

Reply via email to