Re: [CSV] Strategies to handle duplicate headers

Gary Gregory Wed, 21 Jun 2023 06:18:08 -0700

Well, maybe we should not have a postfix string method, that assumes a lot.
A default implementation of a function to convert all header names sounds
better.


Gary

On Wed, Jun 21, 2023, 09:11 Gary Gregory <garydgreg...@gmail.com> wrote:

> So it is starting to sound like we need either to add to CSVFormat:
>
> - "duplicate header postix string", or
> - deprecate duplicate header mode in favor of a duplicate header strategy
> which holds a duplicate header mode plus a duplicate header postfix string
> and some functional interface for custom processing...
>
> Gary
>
> On Wed, Jun 21, 2023, 08:00 David Dellsperger <david.dellsper...@gmail.com>
> wrote:
>
>> I've always had a big concern with this kind of behavior, because what
>> happens if the "new column" already exists but later in the header? It
>> seems like python/pandas deals with this by incrementing AGAIN, so they
>> read the header and THEN decide what to do with the values for duplicates
>> (make sense).  The following CSV
>> A, A, A.1, C, C, C.1
>> 1, 2, 3, 4, 5, 6
>>
>> would lead to the headers of A, A.2, A.1, C, C.2, C.1 in python/pandas.
>>
>> I assume appending '.1' has fewer clashes than just appending '1' at the
>> end and might be why pandas chose that path.  Idea would be you want a
>> strategy that would have as little clash as possible when it comes to
>> extending the names
>>
>> David
>>
>> On Tue, Jun 20, 2023 at 11:24 PM Bruno Kinoshita <ki...@apache.org>
>> wrote:
>>
>> > Hi,
>> >
>> >
>> > > However, I could imagine situations where we define
>> > > DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our
>> > > normalization strategy. For example, dots in the headers breaks
>> ingesting
>> > > the data in a third-party system. An interface could resolve this,
>> but I
>> > > guess in such a scenario, they can also just opt for another mode and
>> > > normalize it themselves to bypass ours.
>> >
>> >
>> >  Good point. I think the only advantage of using dots is following the
>> same
>> > pattern used in Python+Pandas, and also in the R base functions.
>> >
>> > # This is in R
>> >
>> > > read.csv('/tmp/1.csv')
>> >   A A.1 B B.1
>> > 1 1   2 3   4
>> > 2 a   b c   d
>> > >
>> >
>> > However, there are other R libraries that use underscore too (I think
>> > tidyverse does so). So users may have to normalize it themselves already
>> > when using different libraries in R.
>> >
>> > So I think we can use underscore or any other strategy to deduplicate
>> > column names, and allowing users to customize how names are repaired
>> sounds
>> > good too, as long as we can find a good API for that.
>> >
>> > With that in mind, appending the enum does make sense. I'd still be wary
>> > > about making it default behavior anytime soon, unless there's evidence
>> > that
>> > > deduplication is really what users expect.
>> > >
>> > +1
>> >
>> > > Something to consider though. We allow configuring the delimiter. I
>> think
>> > > parsing would be fine, but it might introduce edge-cases for printing
>> if
>> > > the delimiter and normalization strategy overlap. For example, "A,A"
>> > > becomes "A.1,A,2" but the delimiter is ".", effectively making it
>> > > "A.1.A.2". We'll need test cases for that.
>> > >
>> >
>> > I don't know if wrapping the column names with quotes would help in this
>> > case (i.e. "A1."."A.2"), but definitely a good scenario for a test case,
>> > +1.
>> >
>> > -Bruno
>> >
>> > On Wed, 21 Jun 2023 at 02:12, Seth Falco <s...@falco.fun.invalid>
>> wrote:
>> >
>> > > I don't have a strong enough opinion to conclude what's best.
>> > >
>> > > Giving it more thought, I think the interface approach I proposed is
>> > > overcomplicated tbh. I can't imagine needing another duplicate header
>> > mode
>> > > after this.
>> > >
>> > > However, I could imagine situations where we define
>> > > DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our
>> > > normalization strategy. For example, dots in the headers breaks
>> ingesting
>> > > the data in a third-party system. An interface could resolve this,
>> but I
>> > > guess in such a scenario, they can also just opt for another mode and
>> > > normalize it themselves to bypass ours.
>> > >
>> > > With that in mind, appending the enum does make sense. I'd still be
>> wary
>> > > about making it default behavior anytime soon, unless there's evidence
>> > that
>> > > deduplication is really what users expect.
>> > > Something to consider though. We allow configuring the delimiter. I
>> think
>> > > parsing would be fine, but it might introduce edge-cases for printing
>> if
>> > > the delimiter and normalization strategy overlap. For example, "A,A"
>> > > becomes "A.1,A,2" but the delimiter is ".", effectively making it
>> > > "A.1.A.2". We'll need test cases for that.
>> > >
>> > > PS: Sorry if this message goes through twice. Looked to me that the
>> email
>> > > didn't go through the first time.
>> > >
>> > > On 2023/06/20 21:28:16 Gary Gregory wrote:
>> > > > That's clever. So we could implement a new enum value
>> > > > DuplicateHeaderMode.DEDUPLICATE...
>> > > >
>> > > > Gary
>> > > >
>> > > > On Tue, Jun 20, 2023, 14:09 Bruno Kinoshita <ki...@apache.org>
>> > > <ki...@apache.org> wrote:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > Bruno says:
>> > > > > > "With Pandas it automatically deduplicates the column names.
>> Maybe
>> > > > > > that's a feature that we could have in Commons CSV too?"
>> > > > > >
>> > > > > > What does that mean and actually do? Say I have column A with
>> row 1
>> > > > > > value of "X" and 2nd column A with row 1 value of 2. What do I
>> get
>> > > > > > when I ask for column A row 1?
>> > > > > >
>> > > > >
>> > > > > When you ask for column A, you get the first column A with row 1
>> > value
>> > > of
>> > > > > "X". Then Pandas renames the other A column as "A.1". If you want
>> to
>> > > access
>> > > > > rows in the second A column, then you will use "A.1" as index.
>> > > > >
>> > > > > This is useful when you work with CSV's with many headers so that
>> you
>> > > still
>> > > > > have a valid name to use as index to access data, instead of
>> having
>> > to
>> > > rely
>> > > > > on the column index, for instance (or if you are using other
>> > libraries
>> > > that
>> > > > > work with the column names, etc.)
>> > > > >
>> > > > > As a first cut whatever we do could/should maintain the existing
>> > > > > > behavior. We can change the default later by popular demand.
>> > > > > >
>> > > > >
>> > > > > +1
>> > > > >
>> > > > > Cheers
>> > > > >
>> > > > > Bruno
>> > > > >
>> > > > > On Tue, 20 Jun 2023 at 13:39, Gary Gregory <ga...@gmail.com>
>> > > <ga...@gmail.com> wrote:
>> > > > >
>> > > > > > Hi All,
>> > > > > >
>> > > > > > This thread is a follow-up to
>> > > > > >
>> > >
>> https://github.com/apache/commons-csv/pull/309#issuecomment-1441456258
>> > > > > >
>> > > > > > Bruno says:
>> > > > > > "With Pandas it automatically deduplicates the column names.
>> Maybe
>> > > > > > that's a feature that we could have in Commons CSV too?"
>> > > > > >
>> > > > > > What does that mean and actually do? Say I have column A with
>> row 1
>> > > > > > value of "X" and 2nd column A with row 1 value of 2. What do I
>> get
>> > > > > > when I ask for column A row 1?
>> > > > > >
>> > > > > > Seth says:
>> > > > > > "HeaderStrategy Interface
>> > > > > > Contains two functions:
>> > > > > >
>> > > > > > #normalizeHeaders(headings) - With given heading, output a list
>> > that
>> > > > > > fits with whatever the strategy is going for.
>> > > > > > #get(record, header) - Fetch value(s) based on given column
>> name."
>> > > > > >
>> > > > > > I would see perhaps two interfaces so that lambdas might be used
>> > more
>> > > > > > simply. Maybe, needs an example.
>> > > > > >
>> > > > > > "I'm also wary that this may screw up existing projects that
>> depend
>> > > on
>> > > > > > allowing/disallowing duplicates. i.e. want to allow duplicates
>> and
>> > > > > > handle things through indexes / iteration, so this didn't cause
>> a
>> > > > > > problem for them and want to preserve header names, and so don't
>> > need
>> > > > > > the headers deduplicated."
>> > > > > >
>> > > > > > As a first cut whatever we do could/should maintain the existing
>> > > > > > behavior. We can change the default later by popular demand.
>> > > > > >
>> > > > > > Gary
>> > > > > >
>> > > > > >
>> > ---------------------------------------------------------------------
>> > > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> > > > > > For additional commands, e-mail: dev-h...@commons.apache.org
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > > --
>> > > GitHub: https://github.com/SethFalco
>> > > Fediverse <https://en.wikipedia.org/wiki/Fediverse>: @
>> > se...@fosstodon.org
>> > > <https://fosstodon.org/@sethi>
>> > > LinkedIn: https://www.linkedin.com/in/sethfalco/
>> > >
>> >
>>
>

Re: [CSV] Strategies to handle duplicate headers

Reply via email to