Hello,

I have a streaming use case where I plan to keep a dataset broadcasted and
cached on each executor.
Every micro batch in streaming will create a DF out of the RDD and join the
batch.
The below code will perform the broadcast operation for each RDD. Is there
a way to broadcast it just once?

Alternate approachs are also welcome.

    val DF1 = sqlContext.read.format("json").schema(schema1).load(file1)

    val metaDF = sqlContext.read.format("json").schema(schema1).load(file2)
                              .join(DF1, "id")
    metaDF.cache


  val lines = streamingcontext.textFileStream(path)

  lines.foreachRDD( rdd => {
      val recordDF = rdd.flatMap(r => Record(r)).toDF()
      val joinedDF = recordDF.join(broadcast(metaDF), "id")

      joinedDF.rdd.foreachPartition ( partition => {
        partition.foreach( row => {
             ...
             ...
        })
      })
  })

 streamingcontext.start

On a similar note, if the metaDF is too big for broadcast, can I partition
it(df.repartition($"col")) and also partition each streaming RDD?
This way I can avoid shuffling metaDF each time.

Let me know you thoughts.

Thanks

Reply via email to