Re: [CSV] Strategies to handle duplicate headers

David Dellsperger Wed, 21 Jun 2023 05:00:38 -0700

I've always had a big concern with this kind of behavior, because what
happens if the "new column" already exists but later in the header? It
seems like python/pandas deals with this by incrementing AGAIN, so they
read the header and THEN decide what to do with the values for duplicates
(make sense).  The following CSV
A, A, A.1, C, C, C.1
1, 2, 3, 4, 5, 6


would lead to the headers of A, A.2, A.1, C, C.2, C.1 in python/pandas.

I assume appending '.1' has fewer clashes than just appending '1' at the
end and might be why pandas chose that path.  Idea would be you want a
strategy that would have as little clash as possible when it comes to
extending the names

David

On Tue, Jun 20, 2023 at 11:24 PM Bruno Kinoshita <ki...@apache.org> wrote:

> Hi,
>
>
> > However, I could imagine situations where we define
> > DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our
> > normalization strategy. For example, dots in the headers breaks ingesting
> > the data in a third-party system. An interface could resolve this, but I
> > guess in such a scenario, they can also just opt for another mode and
> > normalize it themselves to bypass ours.
>
>
>  Good point. I think the only advantage of using dots is following the same
> pattern used in Python+Pandas, and also in the R base functions.
>
> # This is in R
>
> > read.csv('/tmp/1.csv')
>   A A.1 B B.1
> 1 1   2 3   4
> 2 a   b c   d
> >
>
> However, there are other R libraries that use underscore too (I think
> tidyverse does so). So users may have to normalize it themselves already
> when using different libraries in R.
>
> So I think we can use underscore or any other strategy to deduplicate
> column names, and allowing users to customize how names are repaired sounds
> good too, as long as we can find a good API for that.
>
> With that in mind, appending the enum does make sense. I'd still be wary
> > about making it default behavior anytime soon, unless there's evidence
> that
> > deduplication is really what users expect.
> >
> +1
>
> > Something to consider though. We allow configuring the delimiter. I think
> > parsing would be fine, but it might introduce edge-cases for printing if
> > the delimiter and normalization strategy overlap. For example, "A,A"
> > becomes "A.1,A,2" but the delimiter is ".", effectively making it
> > "A.1.A.2". We'll need test cases for that.
> >
>
> I don't know if wrapping the column names with quotes would help in this
> case (i.e. "A1."."A.2"), but definitely a good scenario for a test case,
> +1.
>
> -Bruno
>
> On Wed, 21 Jun 2023 at 02:12, Seth Falco <s...@falco.fun.invalid> wrote:
>
> > I don't have a strong enough opinion to conclude what's best.
> >
> > Giving it more thought, I think the interface approach I proposed is
> > overcomplicated tbh. I can't imagine needing another duplicate header
> mode
> > after this.
> >
> > However, I could imagine situations where we define
> > DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our
> > normalization strategy. For example, dots in the headers breaks ingesting
> > the data in a third-party system. An interface could resolve this, but I
> > guess in such a scenario, they can also just opt for another mode and
> > normalize it themselves to bypass ours.
> >
> > With that in mind, appending the enum does make sense. I'd still be wary
> > about making it default behavior anytime soon, unless there's evidence
> that
> > deduplication is really what users expect.
> > Something to consider though. We allow configuring the delimiter. I think
> > parsing would be fine, but it might introduce edge-cases for printing if
> > the delimiter and normalization strategy overlap. For example, "A,A"
> > becomes "A.1,A,2" but the delimiter is ".", effectively making it
> > "A.1.A.2". We'll need test cases for that.
> >
> > PS: Sorry if this message goes through twice. Looked to me that the email
> > didn't go through the first time.
> >
> > On 2023/06/20 21:28:16 Gary Gregory wrote:
> > > That's clever. So we could implement a new enum value
> > > DuplicateHeaderMode.DEDUPLICATE...
> > >
> > > Gary
> > >
> > > On Tue, Jun 20, 2023, 14:09 Bruno Kinoshita <ki...@apache.org>
> > <ki...@apache.org> wrote:
> > >
> > > > Hi,
> > > >
> > > > Bruno says:
> > > > > "With Pandas it automatically deduplicates the column names. Maybe
> > > > > that's a feature that we could have in Commons CSV too?"
> > > > >
> > > > > What does that mean and actually do? Say I have column A with row 1
> > > > > value of "X" and 2nd column A with row 1 value of 2. What do I get
> > > > > when I ask for column A row 1?
> > > > >
> > > >
> > > > When you ask for column A, you get the first column A with row 1
> value
> > of
> > > > "X". Then Pandas renames the other A column as "A.1". If you want to
> > access
> > > > rows in the second A column, then you will use "A.1" as index.
> > > >
> > > > This is useful when you work with CSV's with many headers so that you
> > still
> > > > have a valid name to use as index to access data, instead of having
> to
> > rely
> > > > on the column index, for instance (or if you are using other
> libraries
> > that
> > > > work with the column names, etc.)
> > > >
> > > > As a first cut whatever we do could/should maintain the existing
> > > > > behavior. We can change the default later by popular demand.
> > > > >
> > > >
> > > > +1
> > > >
> > > > Cheers
> > > >
> > > > Bruno
> > > >
> > > > On Tue, 20 Jun 2023 at 13:39, Gary Gregory <ga...@gmail.com>
> > <ga...@gmail.com> wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > This thread is a follow-up to
> > > > >
> > https://github.com/apache/commons-csv/pull/309#issuecomment-1441456258
> > > > >
> > > > > Bruno says:
> > > > > "With Pandas it automatically deduplicates the column names. Maybe
> > > > > that's a feature that we could have in Commons CSV too?"
> > > > >
> > > > > What does that mean and actually do? Say I have column A with row 1
> > > > > value of "X" and 2nd column A with row 1 value of 2. What do I get
> > > > > when I ask for column A row 1?
> > > > >
> > > > > Seth says:
> > > > > "HeaderStrategy Interface
> > > > > Contains two functions:
> > > > >
> > > > > #normalizeHeaders(headings) - With given heading, output a list
> that
> > > > > fits with whatever the strategy is going for.
> > > > > #get(record, header) - Fetch value(s) based on given column name."
> > > > >
> > > > > I would see perhaps two interfaces so that lambdas might be used
> more
> > > > > simply. Maybe, needs an example.
> > > > >
> > > > > "I'm also wary that this may screw up existing projects that depend
> > on
> > > > > allowing/disallowing duplicates. i.e. want to allow duplicates and
> > > > > handle things through indexes / iteration, so this didn't cause a
> > > > > problem for them and want to preserve header names, and so don't
> need
> > > > > the headers deduplicated."
> > > > >
> > > > > As a first cut whatever we do could/should maintain the existing
> > > > > behavior. We can change the default later by popular demand.
> > > > >
> > > > > Gary
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > > > For additional commands, e-mail: dev-h...@commons.apache.org
> > > > >
> > > > >
> > > >
> > >
> > --
> > GitHub: https://github.com/SethFalco
> > Fediverse <https://en.wikipedia.org/wiki/Fediverse>: @
> se...@fosstodon.org
> > <https://fosstodon.org/@sethi>
> > LinkedIn: https://www.linkedin.com/in/sethfalco/
> >
>

Re: [CSV] Strategies to handle duplicate headers

Reply via email to