Hi all,

In PySpark, a DataFrame column can be referenced using df["abcd"]
(__getitem__) and df.abcd (__getattr__). There is a discussion on
SPARK-7035 on compatibility issues with the __getattr__ approach, and
I want to collect more inputs on this.

Basically, if in the future we introduce a new method to DataFrame, it
may break user code that uses the same attr to reference a column or
silently changes its behavior. For example, if we add name() to
DataFrame in the next release, all existing code using `df.name` to
reference a column called "name" will break. If we add `name()` as a
property instead of a method, all existing code using `df.name` may
still work but with a different meaning. `df.select(df.name)` no
longer selects the column called "name" but the column that has the
same name as `df.name`.

There are several proposed solutions:

1. Keep both df.abcd and df["abcd"], and encourage users to use the
latter that is future proof. This is the current solution in master
(https://github.com/apache/spark/pull/5971). But I think users may be
still unaware of the compatibility issue and prefer `df.abcd` to
`df["abcd"]` because the former could be auto-completed.
2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
JIRA page: "I actually dragged my feet on the _getattr_ issue for
several months back in the day, then finally added it (and tab
completion in IPython with _dir_), and immediately noticed a huge
quality-of-life improvement when using pandas for actual (esp.
interactive) work."
3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
df["abcd"] would be future proof, and df.abcd_ could be
auto-completed. The tradeoff is apparently the extra "_" appearing in
the code.

My preference is 3 > 1 > 2. Your inputs would be greatly appreciated. Thanks!

Best,
Xiangrui

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to