Re: Duplicate Columns

Wes McKinney Tue, 30 Jan 2018 13:18:02 -0800

In a sense, field names in Arrow schemas are "just data". Whether or
not the data is invalid in the context of a particular use case may
vary a great deal -- for example pandas supports duplicate column
names (to its own hardship, admittedly) while most SQL systems do not.

Sadly, sometimes duplicate column names occur in CSV files; if we
disallow duplicates in the data, this requires a forced resolution of
the issue wherever data may be ingested to Arrow format. This could
cause data validation logic to become spread amongst many data
producers rather than centralized to the other business logic domains
of particular data processing engines (which may or may not have a
problem with duplicates).

My view on these things generally is that the Arrow format and
metadata should be as agnostic as possible to the semantics.
Interested to see what others think, though.

- Wes

On Tue, Jan 30, 2018 at 3:10 PM, Phillip Cloud <cpcl...@gmail.com> wrote:
> I'm working on ARROW-1974
> <https://issues.apache.org/jira/browse/ARROW-1974> right
> now, and it's turning out to be quite complex due to both Arrow and Parquet
> allowing duplicate columns. Apparently you can also write duplicate column
> names to parquet by way of spark.
>
> In my opinion, allowing duplicate columns leads to lots of unnecessary
> complexity. Pandas allows this, and there are lots of hacks and heuristics
> to make it work. For example, if I ask for the "a" column in a parquet
> file, which one do I mean?
>
> I'm not convinced there are use cases that justify the additional
> complexity, however I am definitely willing to be convinced.
>
> Are there any use cases that justify the additional complexity?
>
> If not, I propose that we disallow them in the arrow spec and implement
> this behavior in all supported languages.
>
> -Phillip

Re: Duplicate Columns

Reply via email to