Hi All, I have a data set where each record is serialized using JSON, and I'm interested to use SchemaRDDs to work with the data. Unfortunately I've hit a snag since some fields in the data are maps and list, and are not guaranteed to be populated for each record. This seems to cause inferSchema to throw an error:
Produces error: srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]}, {'foo':'boom', 'baz':[1,2,3]}])) Works fine: srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[1,2,3]}, {'foo':'boom', 'baz':[]}])) To be fair inferSchema says it "peeks at the first row", so a possible work-around would be to make sure the type of any collection can be determined using the first instance. However, I don't believe that items in an RDD are guaranteed to remain in an ordered, so this approach seems somewhat brittle. Does anybody know a robust solution to this problem in PySpark? I'm am running the 1.0.1 release. -Brad