Hi All,

I have a data set where each record is serialized using JSON, and I'm
interested to use SchemaRDDs to work with the data.  Unfortunately I've hit
a snag since some fields in the data are maps and list, and are not
guaranteed to be populated for each record.  This seems to cause
inferSchema to throw an error:

Produces error:
srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]},
{'foo':'boom', 'baz':[1,2,3]}]))

Works fine:
srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[1,2,3]},
{'foo':'boom', 'baz':[]}]))

To be fair inferSchema says it "peeks at the first row", so a possible
work-around would be to make sure the type of any collection can be
determined using the first instance.  However, I don't believe that items
in an RDD are guaranteed to remain in an ordered, so this approach seems
somewhat brittle.

Does anybody know a robust solution to this problem in PySpark?  I'm am
running the 1.0.1 release.

-Brad

Reply via email to