In a sense, field names in Arrow schemas are "just data". Whether or not the data is invalid in the context of a particular use case may vary a great deal -- for example pandas supports duplicate column names (to its own hardship, admittedly) while most SQL systems do not.
Sadly, sometimes duplicate column names occur in CSV files; if we disallow duplicates in the data, this requires a forced resolution of the issue wherever data may be ingested to Arrow format. This could cause data validation logic to become spread amongst many data producers rather than centralized to the other business logic domains of particular data processing engines (which may or may not have a problem with duplicates). My view on these things generally is that the Arrow format and metadata should be as agnostic as possible to the semantics. Interested to see what others think, though. - Wes On Tue, Jan 30, 2018 at 3:10 PM, Phillip Cloud <cpcl...@gmail.com> wrote: > I'm working on ARROW-1974 > <https://issues.apache.org/jira/browse/ARROW-1974> right > now, and it's turning out to be quite complex due to both Arrow and Parquet > allowing duplicate columns. Apparently you can also write duplicate column > names to parquet by way of spark. > > In my opinion, allowing duplicate columns leads to lots of unnecessary > complexity. Pandas allows this, and there are lots of hacks and heuristics > to make it work. For example, if I ask for the "a" column in a parquet > file, which one do I mean? > > I'm not convinced there are use cases that justify the additional > complexity, however I am definitely willing to be convinced. > > Are there any use cases that justify the additional complexity? > > If not, I propose that we disallow them in the arrow spec and implement > this behavior in all supported languages. > > -Phillip