Re: PySpark DataFrame: Preserving nesting when selecting a nested field

Reynold Xin Mon, 11 May 2015 01:46:27 -0700

In 1.4, you can use "struct" function to create a struct, e.g. you can
explicitly select out the "version" column, and then create a new struct
named "settings".



The current semantics of select basically follows closely relational
database's SQL, which is well understood and defined. I wouldn't add any
magic to select to for nested data, because it is not very well researched
& defined and we might get into conflicting scenarios.



On Sat, May 9, 2015 at 1:10 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Take a look:
>
> >>> df = sqlContext.jsonRDD(sc.parallelize(['{"settings": {"os": "OS X",
> "version": "10.10"}}']))>>> df.printSchema()
> root
>  |-- settings: struct (nullable = true)
>  |    |-- os: string (nullable = true)
>  |    |-- version: string (nullable = true)
> >>> # Now I want to "drop" the version column by>>> # selecting everything
> else.>>> # I want to preserve the schema otherwise.>>> # That means `os`
> should stayed nested under>>> # `settings`.
> >>> df.select('settings.os').printSchema()
> root
>  |-- os: string (nullable = true)
> >>> df.select('settings', 'settings.os').printSchema()
> root
>  |-- settings: struct (nullable = true)
>  |    |-- os: string (nullable = true)
>  |    |-- version: string (nullable = true)
>  |-- os: string (nullable = true)
> >>> df.select(df['settings.os'].alias('settings.os')).printSchema()
> root
>  |-- settings.os: string (nullable = true)
>
> In all cases, selecting a nested field loses the original nesting of that
> field.
>
> What I want is to select settings.os and get back a DataFrame with the
> following schema:
>
> root
>  |-- settings: struct (nullable = true)
>  |    |-- os: string (nullable = true)
>
> In other words, I want to preserve the fact that os is nested under
> settings.
> I’m doing this as a work-around for the fact that PySpark does not
> currently support dropping columns,
>
> Until direct support for such a feature lands as part of SPARK-7509
> <https://issues.apache.org/jira/browse/SPARK-7509>, selecting all columns
> but the ones you want to drop seems way better than directly manipulating
> the schema (which is the hackier and way more complex alternative for
> rolling your own “drop” logic).
>
> And you want that process to preserve the schema as much as possible, which
> I assume is how a native “drop column” method would work.
>
> Is it possible though? Or do we have to do direct schema manipulation?
>
> Nick
> 
>

Re: PySpark DataFrame: Preserving nesting when selecting a nested field

Reply via email to