Re: Spark Streaming with compressed xml files

2015-03-16 Thread Tathagata Das
That's why XMLInputFormat suggested by Akhil is a good idea. It should give you full XML object as on record, (as opposed to an XML record spread across multiple line records in textFileStream). Then you could convert each record into a json, thereby making it a json RDD. Then you can save it as a

Re: Spark Streaming with compressed xml files

2015-03-16 Thread Vijay Innamuri
textFileStream and default fileStream recognizes the compressed xml(.xml.gz) files. Each line in the xml file is an element in RDD[string]. Then whole RDD is converted to a proper xml format data and stored in a *Scala variable*. - I believe storing huge data in a *Scala variable* is ineffici

Re: Spark Streaming with compressed xml files

2015-03-15 Thread Akhil Das
One approach would be, If you are using fileStream you can access the individual filenames from the partitions and with that filename you can apply your uncompression logic/parsing logic and get it done. Like: UnionPartition upp = (UnionPartition) ds.values().getPartitions()[i]; NewHadoo