I'm getting a low performance while parsing json data. My cluster setup is
1.2.0 version of spark with 10 Nodes each having 15Gb of memory and 4 cores.
I tried both scala.util.parsing.json.JSON and and fasterxml's Jackson
parser.
This is what i basically do:
*//Approach 1:*
val jsonStream = myDStream.map(x=> {
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.readValue[Map[String,Any]](x)
})
jsonStream.count().print()
*//Approach 2:*
val jsonStream2 =
myDStream.map(JSON.parseFull(_).get.asInstanceOf[scala.collection.immutable.Map[String,
Any]])
jsonStream2.count().print()
It takes around 15-20 Seconds to process/parse 35k json documents (contains
nested documents and arrays) which i put in the stream.
Is there any better approach/parser to process it faster? i also tried it
with mapPartitions but it did not make any difference.
Thanks
Best Regards