In 1.4, you can use "struct" function to create a struct, e.g. you can explicitly select out the "version" column, and then create a new struct named "settings".
The current semantics of select basically follows closely relational database's SQL, which is well understood and defined. I wouldn't add any magic to select to for nested data, because it is not very well researched & defined and we might get into conflicting scenarios. On Sat, May 9, 2015 at 1:10 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > Take a look: > > >>> df = sqlContext.jsonRDD(sc.parallelize(['{"settings": {"os": "OS X", > "version": "10.10"}}']))>>> df.printSchema() > root > |-- settings: struct (nullable = true) > | |-- os: string (nullable = true) > | |-- version: string (nullable = true) > >>> # Now I want to "drop" the version column by>>> # selecting everything > else.>>> # I want to preserve the schema otherwise.>>> # That means `os` > should stayed nested under>>> # `settings`. > >>> df.select('settings.os').printSchema() > root > |-- os: string (nullable = true) > >>> df.select('settings', 'settings.os').printSchema() > root > |-- settings: struct (nullable = true) > | |-- os: string (nullable = true) > | |-- version: string (nullable = true) > |-- os: string (nullable = true) > >>> df.select(df['settings.os'].alias('settings.os')).printSchema() > root > |-- settings.os: string (nullable = true) > > In all cases, selecting a nested field loses the original nesting of that > field. > > What I want is to select settings.os and get back a DataFrame with the > following schema: > > root > |-- settings: struct (nullable = true) > | |-- os: string (nullable = true) > > In other words, I want to preserve the fact that os is nested under > settings. > I’m doing this as a work-around for the fact that PySpark does not > currently support dropping columns, > > Until direct support for such a feature lands as part of SPARK-7509 > <https://issues.apache.org/jira/browse/SPARK-7509>, selecting all columns > but the ones you want to drop seems way better than directly manipulating > the schema (which is the hackier and way more complex alternative for > rolling your own “drop” logic). > > And you want that process to preserve the schema as much as possible, which > I assume is how a native “drop column” method would work. > > Is it possible though? Or do we have to do direct schema manipulation? > > Nick > >