So it is starting to sound like we need either to add to CSVFormat: - "duplicate header postix string", or - deprecate duplicate header mode in favor of a duplicate header strategy which holds a duplicate header mode plus a duplicate header postfix string and some functional interface for custom processing...
Gary On Wed, Jun 21, 2023, 08:00 David Dellsperger <david.dellsper...@gmail.com> wrote: > I've always had a big concern with this kind of behavior, because what > happens if the "new column" already exists but later in the header? It > seems like python/pandas deals with this by incrementing AGAIN, so they > read the header and THEN decide what to do with the values for duplicates > (make sense). The following CSV > A, A, A.1, C, C, C.1 > 1, 2, 3, 4, 5, 6 > > would lead to the headers of A, A.2, A.1, C, C.2, C.1 in python/pandas. > > I assume appending '.1' has fewer clashes than just appending '1' at the > end and might be why pandas chose that path. Idea would be you want a > strategy that would have as little clash as possible when it comes to > extending the names > > David > > On Tue, Jun 20, 2023 at 11:24 PM Bruno Kinoshita <ki...@apache.org> wrote: > > > Hi, > > > > > > > However, I could imagine situations where we define > > > DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our > > > normalization strategy. For example, dots in the headers breaks > ingesting > > > the data in a third-party system. An interface could resolve this, but > I > > > guess in such a scenario, they can also just opt for another mode and > > > normalize it themselves to bypass ours. > > > > > > Good point. I think the only advantage of using dots is following the > same > > pattern used in Python+Pandas, and also in the R base functions. > > > > # This is in R > > > > > read.csv('/tmp/1.csv') > > A A.1 B B.1 > > 1 1 2 3 4 > > 2 a b c d > > > > > > > However, there are other R libraries that use underscore too (I think > > tidyverse does so). So users may have to normalize it themselves already > > when using different libraries in R. > > > > So I think we can use underscore or any other strategy to deduplicate > > column names, and allowing users to customize how names are repaired > sounds > > good too, as long as we can find a good API for that. > > > > With that in mind, appending the enum does make sense. I'd still be wary > > > about making it default behavior anytime soon, unless there's evidence > > that > > > deduplication is really what users expect. > > > > > +1 > > > > > Something to consider though. We allow configuring the delimiter. I > think > > > parsing would be fine, but it might introduce edge-cases for printing > if > > > the delimiter and normalization strategy overlap. For example, "A,A" > > > becomes "A.1,A,2" but the delimiter is ".", effectively making it > > > "A.1.A.2". We'll need test cases for that. > > > > > > > I don't know if wrapping the column names with quotes would help in this > > case (i.e. "A1."."A.2"), but definitely a good scenario for a test case, > > +1. > > > > -Bruno > > > > On Wed, 21 Jun 2023 at 02:12, Seth Falco <s...@falco.fun.invalid> wrote: > > > > > I don't have a strong enough opinion to conclude what's best. > > > > > > Giving it more thought, I think the interface approach I proposed is > > > overcomplicated tbh. I can't imagine needing another duplicate header > > mode > > > after this. > > > > > > However, I could imagine situations where we define > > > DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our > > > normalization strategy. For example, dots in the headers breaks > ingesting > > > the data in a third-party system. An interface could resolve this, but > I > > > guess in such a scenario, they can also just opt for another mode and > > > normalize it themselves to bypass ours. > > > > > > With that in mind, appending the enum does make sense. I'd still be > wary > > > about making it default behavior anytime soon, unless there's evidence > > that > > > deduplication is really what users expect. > > > Something to consider though. We allow configuring the delimiter. I > think > > > parsing would be fine, but it might introduce edge-cases for printing > if > > > the delimiter and normalization strategy overlap. For example, "A,A" > > > becomes "A.1,A,2" but the delimiter is ".", effectively making it > > > "A.1.A.2". We'll need test cases for that. > > > > > > PS: Sorry if this message goes through twice. Looked to me that the > email > > > didn't go through the first time. > > > > > > On 2023/06/20 21:28:16 Gary Gregory wrote: > > > > That's clever. So we could implement a new enum value > > > > DuplicateHeaderMode.DEDUPLICATE... > > > > > > > > Gary > > > > > > > > On Tue, Jun 20, 2023, 14:09 Bruno Kinoshita <ki...@apache.org> > > > <ki...@apache.org> wrote: > > > > > > > > > Hi, > > > > > > > > > > Bruno says: > > > > > > "With Pandas it automatically deduplicates the column names. > Maybe > > > > > > that's a feature that we could have in Commons CSV too?" > > > > > > > > > > > > What does that mean and actually do? Say I have column A with > row 1 > > > > > > value of "X" and 2nd column A with row 1 value of 2. What do I > get > > > > > > when I ask for column A row 1? > > > > > > > > > > > > > > > > When you ask for column A, you get the first column A with row 1 > > value > > > of > > > > > "X". Then Pandas renames the other A column as "A.1". If you want > to > > > access > > > > > rows in the second A column, then you will use "A.1" as index. > > > > > > > > > > This is useful when you work with CSV's with many headers so that > you > > > still > > > > > have a valid name to use as index to access data, instead of having > > to > > > rely > > > > > on the column index, for instance (or if you are using other > > libraries > > > that > > > > > work with the column names, etc.) > > > > > > > > > > As a first cut whatever we do could/should maintain the existing > > > > > > behavior. We can change the default later by popular demand. > > > > > > > > > > > > > > > > +1 > > > > > > > > > > Cheers > > > > > > > > > > Bruno > > > > > > > > > > On Tue, 20 Jun 2023 at 13:39, Gary Gregory <ga...@gmail.com> > > > <ga...@gmail.com> wrote: > > > > > > > > > > > Hi All, > > > > > > > > > > > > This thread is a follow-up to > > > > > > > > > https://github.com/apache/commons-csv/pull/309#issuecomment-1441456258 > > > > > > > > > > > > Bruno says: > > > > > > "With Pandas it automatically deduplicates the column names. > Maybe > > > > > > that's a feature that we could have in Commons CSV too?" > > > > > > > > > > > > What does that mean and actually do? Say I have column A with > row 1 > > > > > > value of "X" and 2nd column A with row 1 value of 2. What do I > get > > > > > > when I ask for column A row 1? > > > > > > > > > > > > Seth says: > > > > > > "HeaderStrategy Interface > > > > > > Contains two functions: > > > > > > > > > > > > #normalizeHeaders(headings) - With given heading, output a list > > that > > > > > > fits with whatever the strategy is going for. > > > > > > #get(record, header) - Fetch value(s) based on given column > name." > > > > > > > > > > > > I would see perhaps two interfaces so that lambdas might be used > > more > > > > > > simply. Maybe, needs an example. > > > > > > > > > > > > "I'm also wary that this may screw up existing projects that > depend > > > on > > > > > > allowing/disallowing duplicates. i.e. want to allow duplicates > and > > > > > > handle things through indexes / iteration, so this didn't cause a > > > > > > problem for them and want to preserve header names, and so don't > > need > > > > > > the headers deduplicated." > > > > > > > > > > > > As a first cut whatever we do could/should maintain the existing > > > > > > behavior. We can change the default later by popular demand. > > > > > > > > > > > > Gary > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > > > > > > > > > > > > > > > > > > -- > > > GitHub: https://github.com/SethFalco > > > Fediverse <https://en.wikipedia.org/wiki/Fediverse>: @ > > se...@fosstodon.org > > > <https://fosstodon.org/@sethi> > > > LinkedIn: https://www.linkedin.com/in/sethfalco/ > > > > > >