RE: [CSV] Strategies to handle duplicate headers

Seth Falco Tue, 20 Jun 2023 17:12:17 -0700

I don't have a strong enough opinion to conclude what's best.

Giving it more thought, I think the interface approach I proposed is overcomplicated tbh. I can't imagine needing another duplicate header mode after this.

However, I could imagine situations where we define DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our normalization strategy. For example, dots in the headers breaks ingesting the data in a third-party system. An interface could resolve this, but I guess in such a scenario, they can also just opt for another mode and normalize it themselves to bypass ours.

With that in mind, appending the enum does make sense. I'd still be wary about making it default behavior anytime soon, unless there's evidence that deduplication is really what users expect.

Something to consider though. We allow configuring the delimiter. I think parsing would be fine, but it might introduce edge-cases for printing if the delimiter and normalization strategy overlap. For example, "A,A" becomes "A.1,A,2" but the delimiter is ".", effectively making it "A.1.A.2". We'll need test cases for that.

PS: Sorry if this message goes through twice. Looked to me that the email didn't go through the first time.


On 2023/06/20 21:28:16 Gary Gregory wrote:

> That's clever. So we could implement a new enum value
> DuplicateHeaderMode.DEDUPLICATE...
>
> Gary
>
> On Tue, Jun 20, 2023, 14:09 Bruno Kinoshita <[email protected]> wrote:
>
> > Hi,
> >
> > Bruno says:
> > > "With Pandas it automatically deduplicates the column names. Maybe
> > > that's a feature that we could have in Commons CSV too?"
> > >
> > > What does that mean and actually do? Say I have column A with row 1
> > > value of "X" and 2nd column A with row 1 value of 2. What do I get
> > > when I ask for column A row 1?
> > >
> >

> > When you ask for column A, you get the first column A with row 1 value of > > "X". Then Pandas renames the other A column as "A.1". If you want to access

> > rows in the second A column, then you will use "A.1" as index.
> >

> > This is useful when you work with CSV's with many headers so that you still > > have a valid name to use as index to access data, instead of having to rely > > on the column index, for instance (or if you are using other libraries that

> > work with the column names, etc.)
> >
> > As a first cut whatever we do could/should maintain the existing
> > > behavior. We can change the default later by popular demand.
> > >
> >
> > +1
> >
> > Cheers
> >
> > Bruno
> >
> > On Tue, 20 Jun 2023 at 13:39, Gary Gregory <[email protected]> wrote:
> >
> > > Hi All,
> > >
> > > This thread is a follow-up to

> > > https://github.com/apache/commons-csv/pull/309#issuecomment-1441456258

> > >
> > > Bruno says:
> > > "With Pandas it automatically deduplicates the column names. Maybe
> > > that's a feature that we could have in Commons CSV too?"
> > >
> > > What does that mean and actually do? Say I have column A with row 1
> > > value of "X" and 2nd column A with row 1 value of 2. What do I get
> > > when I ask for column A row 1?
> > >
> > > Seth says:
> > > "HeaderStrategy Interface
> > > Contains two functions:
> > >
> > > #normalizeHeaders(headings) - With given heading, output a list that
> > > fits with whatever the strategy is going for.
> > > #get(record, header) - Fetch value(s) based on given column name."
> > >
> > > I would see perhaps two interfaces so that lambdas might be used more
> > > simply. Maybe, needs an example.
> > >

> > > "I'm also wary that this may screw up existing projects that depend on

> > > allowing/disallowing duplicates. i.e. want to allow duplicates and
> > > handle things through indexes / iteration, so this didn't cause a
> > > problem for them and want to preserve header names, and so don't need
> > > the headers deduplicated."
> > >
> > > As a first cut whatever we do could/should maintain the existing
> > > behavior. We can change the default later by popular demand.
> > >
> > > Gary
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
>
--
GitHub: https://github.com/SethFalco

Fediverse <https://en.wikipedia.org/wiki/Fediverse>: @[email protected] <https://fosstodon.org/@sethi>

LinkedIn: https://www.linkedin.com/in/sethfalco/

OpenPGP_0xDE1C217EFF01FEC8.asc
Description: OpenPGP public key

OpenPGP_signature
Description: OpenPGP digital signature

RE: [CSV] Strategies to handle duplicate headers

Reply via email to