Re: [CSV] Strategies to handle duplicate headers

Gary Gregory Wed, 21 Jun 2023 06:12:38 -0700

So it is starting to sound like we need either to add to CSVFormat:

- "duplicate header postix string", or
- deprecate duplicate header mode in favor of a duplicate header strategy
which holds a duplicate header mode plus a duplicate header postfix string
and some functional interface for custom processing...


Gary

On Wed, Jun 21, 2023, 08:00 David Dellsperger <[email protected]>
wrote:

> I've always had a big concern with this kind of behavior, because what
> happens if the "new column" already exists but later in the header? It
> seems like python/pandas deals with this by incrementing AGAIN, so they
> read the header and THEN decide what to do with the values for duplicates
> (make sense).  The following CSV
> A, A, A.1, C, C, C.1
> 1, 2, 3, 4, 5, 6
>
> would lead to the headers of A, A.2, A.1, C, C.2, C.1 in python/pandas.
>
> I assume appending '.1' has fewer clashes than just appending '1' at the
> end and might be why pandas chose that path.  Idea would be you want a
> strategy that would have as little clash as possible when it comes to
> extending the names
>
> David
>
> On Tue, Jun 20, 2023 at 11:24 PM Bruno Kinoshita <[email protected]> wrote:
>
> > Hi,
> >
> >
> > > However, I could imagine situations where we define
> > > DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our
> > > normalization strategy. For example, dots in the headers breaks
> ingesting
> > > the data in a third-party system. An interface could resolve this, but
> I
> > > guess in such a scenario, they can also just opt for another mode and
> > > normalize it themselves to bypass ours.
> >
> >
> >  Good point. I think the only advantage of using dots is following the
> same
> > pattern used in Python+Pandas, and also in the R base functions.
> >
> > # This is in R
> >
> > > read.csv('/tmp/1.csv')
> >   A A.1 B B.1
> > 1 1   2 3   4
> > 2 a   b c   d
> > >
> >
> > However, there are other R libraries that use underscore too (I think
> > tidyverse does so). So users may have to normalize it themselves already
> > when using different libraries in R.
> >
> > So I think we can use underscore or any other strategy to deduplicate
> > column names, and allowing users to customize how names are repaired
> sounds
> > good too, as long as we can find a good API for that.
> >
> > With that in mind, appending the enum does make sense. I'd still be wary
> > > about making it default behavior anytime soon, unless there's evidence
> > that
> > > deduplication is really what users expect.
> > >
> > +1
> >
> > > Something to consider though. We allow configuring the delimiter. I
> think
> > > parsing would be fine, but it might introduce edge-cases for printing
> if
> > > the delimiter and normalization strategy overlap. For example, "A,A"
> > > becomes "A.1,A,2" but the delimiter is ".", effectively making it
> > > "A.1.A.2". We'll need test cases for that.
> > >
> >
> > I don't know if wrapping the column names with quotes would help in this
> > case (i.e. "A1."."A.2"), but definitely a good scenario for a test case,
> > +1.
> >
> > -Bruno
> >
> > On Wed, 21 Jun 2023 at 02:12, Seth Falco <[email protected]> wrote:
> >
> > > I don't have a strong enough opinion to conclude what's best.
> > >
> > > Giving it more thought, I think the interface approach I proposed is
> > > overcomplicated tbh. I can't imagine needing another duplicate header
> > mode
> > > after this.
> > >
> > > However, I could imagine situations where we define
> > > DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our
> > > normalization strategy. For example, dots in the headers breaks
> ingesting
> > > the data in a third-party system. An interface could resolve this, but
> I
> > > guess in such a scenario, they can also just opt for another mode and
> > > normalize it themselves to bypass ours.
> > >
> > > With that in mind, appending the enum does make sense. I'd still be
> wary
> > > about making it default behavior anytime soon, unless there's evidence
> > that
> > > deduplication is really what users expect.
> > > Something to consider though. We allow configuring the delimiter. I
> think
> > > parsing would be fine, but it might introduce edge-cases for printing
> if
> > > the delimiter and normalization strategy overlap. For example, "A,A"
> > > becomes "A.1,A,2" but the delimiter is ".", effectively making it
> > > "A.1.A.2". We'll need test cases for that.
> > >
> > > PS: Sorry if this message goes through twice. Looked to me that the
> email
> > > didn't go through the first time.
> > >
> > > On 2023/06/20 21:28:16 Gary Gregory wrote:
> > > > That's clever. So we could implement a new enum value
> > > > DuplicateHeaderMode.DEDUPLICATE...
> > > >
> > > > Gary
> > > >
> > > > On Tue, Jun 20, 2023, 14:09 Bruno Kinoshita <[email protected]>
> > > <[email protected]> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Bruno says:
> > > > > > "With Pandas it automatically deduplicates the column names.
> Maybe
> > > > > > that's a feature that we could have in Commons CSV too?"
> > > > > >
> > > > > > What does that mean and actually do? Say I have column A with
> row 1
> > > > > > value of "X" and 2nd column A with row 1 value of 2. What do I
> get
> > > > > > when I ask for column A row 1?
> > > > > >
> > > > >
> > > > > When you ask for column A, you get the first column A with row 1
> > value
> > > of
> > > > > "X". Then Pandas renames the other A column as "A.1". If you want
> to
> > > access
> > > > > rows in the second A column, then you will use "A.1" as index.
> > > > >
> > > > > This is useful when you work with CSV's with many headers so that
> you
> > > still
> > > > > have a valid name to use as index to access data, instead of having
> > to
> > > rely
> > > > > on the column index, for instance (or if you are using other
> > libraries
> > > that
> > > > > work with the column names, etc.)
> > > > >
> > > > > As a first cut whatever we do could/should maintain the existing
> > > > > > behavior. We can change the default later by popular demand.
> > > > > >
> > > > >
> > > > > +1
> > > > >
> > > > > Cheers
> > > > >
> > > > > Bruno
> > > > >
> > > > > On Tue, 20 Jun 2023 at 13:39, Gary Gregory <[email protected]>
> > > <[email protected]> wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > This thread is a follow-up to
> > > > > >
> > > https://github.com/apache/commons-csv/pull/309#issuecomment-1441456258
> > > > > >
> > > > > > Bruno says:
> > > > > > "With Pandas it automatically deduplicates the column names.
> Maybe
> > > > > > that's a feature that we could have in Commons CSV too?"
> > > > > >
> > > > > > What does that mean and actually do? Say I have column A with
> row 1
> > > > > > value of "X" and 2nd column A with row 1 value of 2. What do I
> get
> > > > > > when I ask for column A row 1?
> > > > > >
> > > > > > Seth says:
> > > > > > "HeaderStrategy Interface
> > > > > > Contains two functions:
> > > > > >
> > > > > > #normalizeHeaders(headings) - With given heading, output a list
> > that
> > > > > > fits with whatever the strategy is going for.
> > > > > > #get(record, header) - Fetch value(s) based on given column
> name."
> > > > > >
> > > > > > I would see perhaps two interfaces so that lambdas might be used
> > more
> > > > > > simply. Maybe, needs an example.
> > > > > >
> > > > > > "I'm also wary that this may screw up existing projects that
> depend
> > > on
> > > > > > allowing/disallowing duplicates. i.e. want to allow duplicates
> and
> > > > > > handle things through indexes / iteration, so this didn't cause a
> > > > > > problem for them and want to preserve header names, and so don't
> > need
> > > > > > the headers deduplicated."
> > > > > >
> > > > > > As a first cut whatever we do could/should maintain the existing
> > > > > > behavior. We can change the default later by popular demand.
> > > > > >
> > > > > > Gary
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > For additional commands, e-mail: [email protected]
> > > > > >
> > > > > >
> > > > >
> > > >
> > > --
> > > GitHub: https://github.com/SethFalco
> > > Fediverse <https://en.wikipedia.org/wiki/Fediverse>: @
> > [email protected]
> > > <https://fosstodon.org/@sethi>
> > > LinkedIn: https://www.linkedin.com/in/sethfalco/
> > >
> >
>

Re: [CSV] Strategies to handle duplicate headers

Reply via email to