That's why XMLInputFormat suggested by Akhil is a good idea. It should give
you full XML object as on record, (as opposed to an XML record spread
across multiple line records in textFileStream). Then you could convert
each record into a json, thereby making it a json RDD. Then you can save it
as a
textFileStream and default fileStream recognizes the compressed
xml(.xml.gz) files.
Each line in the xml file is an element in RDD[string].
Then whole RDD is converted to a proper xml format data and stored in a *Scala
variable*.
- I believe storing huge data in a *Scala variable* is ineffici
One approach would be, If you are using fileStream you can access the
individual filenames from the partitions and with that filename you can
apply your uncompression logic/parsing logic and get it done.
Like:
UnionPartition upp = (UnionPartition)
ds.values().getPartitions()[i]; NewHadoo
Hi All,
Processing streaming JSON files with Spark features (Spark streaming and
Spark SQL), is very efficient and works like a charm.
Below is the code snippet to process JSON files.
windowDStream.foreachRDD(IncomingFiles => {
val IncomingFilesTable = sqlContext.jsonRDD(Incoming