Well, maybe we should not have a postfix string method, that assumes a lot. A default implementation of a function to convert all header names sounds better.
Gary On Wed, Jun 21, 2023, 09:11 Gary Gregory <garydgreg...@gmail.com> wrote: > So it is starting to sound like we need either to add to CSVFormat: > > - "duplicate header postix string", or > - deprecate duplicate header mode in favor of a duplicate header strategy > which holds a duplicate header mode plus a duplicate header postfix string > and some functional interface for custom processing... > > Gary > > On Wed, Jun 21, 2023, 08:00 David Dellsperger <david.dellsper...@gmail.com> > wrote: > >> I've always had a big concern with this kind of behavior, because what >> happens if the "new column" already exists but later in the header? It >> seems like python/pandas deals with this by incrementing AGAIN, so they >> read the header and THEN decide what to do with the values for duplicates >> (make sense). The following CSV >> A, A, A.1, C, C, C.1 >> 1, 2, 3, 4, 5, 6 >> >> would lead to the headers of A, A.2, A.1, C, C.2, C.1 in python/pandas. >> >> I assume appending '.1' has fewer clashes than just appending '1' at the >> end and might be why pandas chose that path. Idea would be you want a >> strategy that would have as little clash as possible when it comes to >> extending the names >> >> David >> >> On Tue, Jun 20, 2023 at 11:24 PM Bruno Kinoshita <ki...@apache.org> >> wrote: >> >> > Hi, >> > >> > >> > > However, I could imagine situations where we define >> > > DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our >> > > normalization strategy. For example, dots in the headers breaks >> ingesting >> > > the data in a third-party system. An interface could resolve this, >> but I >> > > guess in such a scenario, they can also just opt for another mode and >> > > normalize it themselves to bypass ours. >> > >> > >> > Good point. I think the only advantage of using dots is following the >> same >> > pattern used in Python+Pandas, and also in the R base functions. >> > >> > # This is in R >> > >> > > read.csv('/tmp/1.csv') >> > A A.1 B B.1 >> > 1 1 2 3 4 >> > 2 a b c d >> > > >> > >> > However, there are other R libraries that use underscore too (I think >> > tidyverse does so). So users may have to normalize it themselves already >> > when using different libraries in R. >> > >> > So I think we can use underscore or any other strategy to deduplicate >> > column names, and allowing users to customize how names are repaired >> sounds >> > good too, as long as we can find a good API for that. >> > >> > With that in mind, appending the enum does make sense. I'd still be wary >> > > about making it default behavior anytime soon, unless there's evidence >> > that >> > > deduplication is really what users expect. >> > > >> > +1 >> > >> > > Something to consider though. We allow configuring the delimiter. I >> think >> > > parsing would be fine, but it might introduce edge-cases for printing >> if >> > > the delimiter and normalization strategy overlap. For example, "A,A" >> > > becomes "A.1,A,2" but the delimiter is ".", effectively making it >> > > "A.1.A.2". We'll need test cases for that. >> > > >> > >> > I don't know if wrapping the column names with quotes would help in this >> > case (i.e. "A1."."A.2"), but definitely a good scenario for a test case, >> > +1. >> > >> > -Bruno >> > >> > On Wed, 21 Jun 2023 at 02:12, Seth Falco <s...@falco.fun.invalid> >> wrote: >> > >> > > I don't have a strong enough opinion to conclude what's best. >> > > >> > > Giving it more thought, I think the interface approach I proposed is >> > > overcomplicated tbh. I can't imagine needing another duplicate header >> > mode >> > > after this. >> > > >> > > However, I could imagine situations where we define >> > > DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our >> > > normalization strategy. For example, dots in the headers breaks >> ingesting >> > > the data in a third-party system. An interface could resolve this, >> but I >> > > guess in such a scenario, they can also just opt for another mode and >> > > normalize it themselves to bypass ours. >> > > >> > > With that in mind, appending the enum does make sense. I'd still be >> wary >> > > about making it default behavior anytime soon, unless there's evidence >> > that >> > > deduplication is really what users expect. >> > > Something to consider though. We allow configuring the delimiter. I >> think >> > > parsing would be fine, but it might introduce edge-cases for printing >> if >> > > the delimiter and normalization strategy overlap. For example, "A,A" >> > > becomes "A.1,A,2" but the delimiter is ".", effectively making it >> > > "A.1.A.2". We'll need test cases for that. >> > > >> > > PS: Sorry if this message goes through twice. Looked to me that the >> email >> > > didn't go through the first time. >> > > >> > > On 2023/06/20 21:28:16 Gary Gregory wrote: >> > > > That's clever. So we could implement a new enum value >> > > > DuplicateHeaderMode.DEDUPLICATE... >> > > > >> > > > Gary >> > > > >> > > > On Tue, Jun 20, 2023, 14:09 Bruno Kinoshita <ki...@apache.org> >> > > <ki...@apache.org> wrote: >> > > > >> > > > > Hi, >> > > > > >> > > > > Bruno says: >> > > > > > "With Pandas it automatically deduplicates the column names. >> Maybe >> > > > > > that's a feature that we could have in Commons CSV too?" >> > > > > > >> > > > > > What does that mean and actually do? Say I have column A with >> row 1 >> > > > > > value of "X" and 2nd column A with row 1 value of 2. What do I >> get >> > > > > > when I ask for column A row 1? >> > > > > > >> > > > > >> > > > > When you ask for column A, you get the first column A with row 1 >> > value >> > > of >> > > > > "X". Then Pandas renames the other A column as "A.1". If you want >> to >> > > access >> > > > > rows in the second A column, then you will use "A.1" as index. >> > > > > >> > > > > This is useful when you work with CSV's with many headers so that >> you >> > > still >> > > > > have a valid name to use as index to access data, instead of >> having >> > to >> > > rely >> > > > > on the column index, for instance (or if you are using other >> > libraries >> > > that >> > > > > work with the column names, etc.) >> > > > > >> > > > > As a first cut whatever we do could/should maintain the existing >> > > > > > behavior. We can change the default later by popular demand. >> > > > > > >> > > > > >> > > > > +1 >> > > > > >> > > > > Cheers >> > > > > >> > > > > Bruno >> > > > > >> > > > > On Tue, 20 Jun 2023 at 13:39, Gary Gregory <ga...@gmail.com> >> > > <ga...@gmail.com> wrote: >> > > > > >> > > > > > Hi All, >> > > > > > >> > > > > > This thread is a follow-up to >> > > > > > >> > > >> https://github.com/apache/commons-csv/pull/309#issuecomment-1441456258 >> > > > > > >> > > > > > Bruno says: >> > > > > > "With Pandas it automatically deduplicates the column names. >> Maybe >> > > > > > that's a feature that we could have in Commons CSV too?" >> > > > > > >> > > > > > What does that mean and actually do? Say I have column A with >> row 1 >> > > > > > value of "X" and 2nd column A with row 1 value of 2. What do I >> get >> > > > > > when I ask for column A row 1? >> > > > > > >> > > > > > Seth says: >> > > > > > "HeaderStrategy Interface >> > > > > > Contains two functions: >> > > > > > >> > > > > > #normalizeHeaders(headings) - With given heading, output a list >> > that >> > > > > > fits with whatever the strategy is going for. >> > > > > > #get(record, header) - Fetch value(s) based on given column >> name." >> > > > > > >> > > > > > I would see perhaps two interfaces so that lambdas might be used >> > more >> > > > > > simply. Maybe, needs an example. >> > > > > > >> > > > > > "I'm also wary that this may screw up existing projects that >> depend >> > > on >> > > > > > allowing/disallowing duplicates. i.e. want to allow duplicates >> and >> > > > > > handle things through indexes / iteration, so this didn't cause >> a >> > > > > > problem for them and want to preserve header names, and so don't >> > need >> > > > > > the headers deduplicated." >> > > > > > >> > > > > > As a first cut whatever we do could/should maintain the existing >> > > > > > behavior. We can change the default later by popular demand. >> > > > > > >> > > > > > Gary >> > > > > > >> > > > > > >> > --------------------------------------------------------------------- >> > > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >> > > > > > For additional commands, e-mail: dev-h...@commons.apache.org >> > > > > > >> > > > > > >> > > > > >> > > > >> > > -- >> > > GitHub: https://github.com/SethFalco >> > > Fediverse <https://en.wikipedia.org/wiki/Fediverse>: @ >> > se...@fosstodon.org >> > > <https://fosstodon.org/@sethi> >> > > LinkedIn: https://www.linkedin.com/in/sethfalco/ >> > > >> > >> >