I am trying to convert terabytes of json log files into parquet files.
but I need to clean it a little first.
I end up doing the following
txt = sc.textFile(inpath).coalesce(800)
val json = (for {
line <- txt
JObject(child) = parse(line)
child2 = (for {
JField(name, value) <- child
_ <- patt(name) // filter fields with invalid names
} yield JField(name.toLowerCase, value))
} yield compact(render(JObject(child2))))
sqx.jsonRDD(json, 5e-2).saveAsParquetFile(outpath)
And glaring inefficiency is that after parsing and cleaning the data i
reserialize it
by calling compact(render(JObject(child2)))) only to pass the text
to jsonRDD to be parsed agian. However I see no way to turn an RDD of
json4s objects directly into a SchemRDD without turning it back into text
first
Is there any way to do this?
I am also open to other suggestions for speeding up the above code,
it is very slow in its current form.
I would also like to make jsonFile drop invalid json records rather than
failing the entire job. Is that possible?
thanks
Daniel