Nice Koert, lets hope it gets merged soon. /Anders
On Fri, Oct 23, 2015 at 6:32 PM Koert Kuipers <ko...@tresata.com> wrote: > https://github.com/databricks/spark-avro/pull/95 > > On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> oh no wonder... it undoes the glob (i was reading from /some/path/*), >> creates a hadoopRdd for every path, and then creates a union of them using >> UnionRDD. >> >> thats not what i want... no need to do union. AvroInpuFormat already has >> the ability to handle globs (or multiple paths comma separated) very >> efficiently. AvroRelation should just pass the paths (comma separated). >> >> >> >> >> On Thu, Oct 22, 2015 at 1:37 PM, Anders Arpteg <arp...@spotify.com> >> wrote: >> >>> Yes, seems unnecessary. I actually tried patching the >>> com.databricks.spark.avro reader to only broadcast once per dataset, >>> instead of every single file/partition. It seems to work just as fine, and >>> there are significantly less broadcasts and not seeing out of memory issues >>> any more. Strange that more people does not react to this, since the >>> broadcasting seems completely unnecessary... >>> >>> Best, >>> Anders >>> >>> >>> On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers <ko...@tresata.com> wrote: >>> >>>> i am seeing the same thing. its gona completely crazy creating >>>> broadcasts for the last 15 mins or so. killing it... >>>> >>>> On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg <arp...@spotify.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Running spark 1.5.0 in yarn-client mode, and am curios in why there >>>>> are so many broadcast being done when loading datasets with large number >>>>> of >>>>> partitions/files. Have datasets with thousands of partitions, i.e. hdfs >>>>> files in the avro folder, and sometime loading hundreds of these large >>>>> datasets. Believe I have located the broadcast to line >>>>> SparkContext.scala:1006. It seems to just broadcast the hadoop >>>>> configuration, and I don't see why it should be necessary to broadcast >>>>> that >>>>> for EVERY file? Wouldn't it be possible to reuse the same broadcast >>>>> configuration? It hardly the case the the configuration would be different >>>>> between each file in a single dataset. Seems to be wasting lots of memory >>>>> and needs to persist unnecessarily to disk (see below again). >>>>> >>>>> Thanks, >>>>> Anders >>>>> >>>>> 15/09/24 17:11:11 INFO BlockManager: Writing block >>>>> broadcast_1871_piece0 to disk >>>>> [19/49086]15/09/24 17:11:11 INFO BlockManagerInfo: Added >>>>> broadcast_1871_piece0 on disk on 10.254.35.24:49428 (size: 23.1 KB) >>>>> 15/09/24 17:11:11 INFO MemoryStore: Block broadcast_4803_piece0 stored >>>>> as bytes in memory (estimated size 23.1 KB, free 2.4 KB) >>>>> 15/09/24 17:11:11 INFO BlockManagerInfo: Added broadcast_4803_piece0 >>>>> in memory on 10.254.35.24:49428 (size: 23.1 KB, free: 464.0 MB) >>>>> 15/09/24 17:11:11 INFO SpotifySparkContext: Created broadcast 4803 >>>>> from hadoopFile at AvroRelation.scala:121 >>>>> 15/09/24 17:11:11 WARN MemoryStore: Failed to reserve initial memory >>>>> threshold of 1024.0 KB for computing block broadcast_4804 in memory >>>>> . >>>>> 15/09/24 17:11:11 WARN MemoryStore: Not enough space to cache >>>>> broadcast_4804 in memory! (computed 496.0 B so far) >>>>> 15/09/24 17:11:11 INFO MemoryStore: Memory use = 530.3 MB (blocks) + >>>>> 0.0 B (scratch space shared across 0 tasks(s)) = 530.3 MB. Storage >>>>> limit = 530.3 MB. >>>>> 15/09/24 17:11:11 WARN MemoryStore: Persisting block broadcast_4804 to >>>>> disk instead. >>>>> 15/09/24 17:11:11 INFO MemoryStore: ensureFreeSpace(23703) called with >>>>> curMem=556036460, maxMem=556038881 >>>>> 15/09/24 17:11:11 INFO MemoryStore: 1 blocks selected for dropping >>>>> 15/09/24 17:11:11 INFO BlockManager: Dropping block >>>>> broadcast_1872_piece0 from memory >>>>> 15/09/24 17:11:11 INFO BlockManager: Writing block >>>>> broadcast_1872_piece0 to disk >>>>> >>>>> >>>> >>>> >> >