> > Will this be able to handle projection pushdown if a given job doesn't > utilize all the columns in the schema? Or should people have a
per-job schema? > As currently written, we will do a little bit of extra work to pull out fields that aren't needed. I think it would be pretty straight forward to add a rule to the optimizer that prunes the schema passed to the JsonToStruct expression when there is another Project operator present. I’m not a spark guru, but I would have hoped that DataSets and DataFrames > were more dynamic. We are dynamic in that all of these decisions can be made at runtime, and you can even look at the data when making them. We do however need to know the schema before any single query begins executing so that we can give good analysis error messages and so that we can generate efficient byte code in our code generation. > You should be doing schema inference. JSON includes the schema with each > record and you should take advantage of it. I guess the only issue is > that DataSets / DataFrames have static schemas and structures. Then if your > first record doesn’t include all of the columns you will have a problem. I agree that for ad-hoc use cases we should make it easy to infer the schema. I would also argue that for a production pipeline you need the ability to specify it manually to avoid surprises. There are several tricky cases here. You bring up the fact that the first record might be missing fields, but in many data sets there are fields that are only present in 1 out of 100,000s records. Even if all fields are present, sometimes it can be very expensive to get even the first record (say you are reading from an expensive query coming from the JDBC data source). Another issue, is that inference means you need to read some data before the user explicitly starts the query. Historically, cases where we do this have been pretty confusing to users of Spark (think: the surprise job that finds partition boundaries for RDD.sort). So, I think we should add inference, but that it should be in addition to the API proposed in this PR.