Thank you for your excellent investigation, Alex, and for finding and fixing the missing Test annotations. Overall, we have 98% coverage.
Taking stock, I think that this is where we are, we have main 3 issues: 1) Compatibility of duplicate header behavior from version to version 2) Validation bug with empty headers 3) Writing and parsing behavior mismatch One at a time: 1) Compatibility of duplicate header behavior from version to version - We have behavior [A] from 1.0 to 1.6. - We have behavior [B] from 1.7 to 1.9. - We are discussing if 1.10.0 should have behavior [A] or [B] I would vote for keeping behavior B for simplicity and keeping with the principle of least surprise: Maintaining the current behavior is better than flip-flopping. 2) Validation bug with empty headers Fixed in git master commit 289ffa16275c518722c6cda913bfca8a6e1a1411; please do verify the tests. I've added 3 test methods per Alex's failing examples below. 3) Writing and parsing behavior mismatch I've updated the documentation in git master as Alex suggested. I will cancel the RC1 vote in a separate message with a [CANCEL] subject prefix. Gary [A] From Alex: > From my investigation the behaviour from 1.0 to 1.6 was to throw on > duplicates, but ignore empty duplicates (so allowing pad columns in a > CSV). [B] From Alex: > The allowDuplicates flag was added in 1.7 to allow duplicates. DEFAULT > behaviour was true! So the behavioural compatibility was broken in > release 1.7. Note there is also an allowMissingColumnNames flag. Logic > was: > > - Store column headers as we go > - For each new column header, check in the existing headers > - If present: > -- If not empty then throw if allowDuplicateHeaderNames=true > -- If empty then throw if allowMissingColumnNames=false > > There is some logic failure here. It would allow the first missing > column name for example; only the second empty missing column name > would trigger an exception. On 2022/10/21 16:33:23 Alex Herbert wrote: > From my investigation the behaviour from 1.0 to 1.6 was to throw on > duplicates, but ignore empty duplicates (so allowing pad columns in a > CSV). > > The allowDuplicates flag was added in 1.7 to allow duplicates. DEFAULT > behaviour was true! So the behavioural compatibility was broken in > release 1.7. Note there is also an allowMissingColumnNames flag. Logic > was: > > - Store column headers as we go > - For each new column header, check in the existing headers > - If present: > -- If not empty then throw if allowDuplicateHeaderNames=true > -- If empty then throw if allowMissingColumnNames=false > > There is some logic failure here. It would allow the first missing > column name for example; only the second empty missing column name > would trigger an exception. > > The current master has updated this logic somewhat but I think the > concept is still the same. There are tests to show that the default > enum value is ALLOW_ALL. This is the same behaviour as 1.7, but not > 1.0 - 1.6. > > Setting the boolean flag to true -> ALLOW_ALL. > Setting the boolean flag to false -> ALLOW_EMPTY. > > So I think the default settings are correct, and the use of the old > flag flips the behaviour correctly. But when investigating I found > another issue. > > I looked at the CSVFormat tests and found two fixtures missing the > @Test annotation: > > testDuplicateHeaderElementsTrue() > testDuplicateHeaderElementsTrue_Deprecated() > > I added these to master. > > The tests for duplicate headers use the builder and setHeader("A", > "A"). So there are no tests for empty headers. Trying this throws an > exception in CSVFormat.validate for all of these: > > CSVFormat.DEFAULT.builder().setAllowDuplicateHeaderNames(false).setHeader("A", > "", "B", "").build(); > CSVFormat.DEFAULT.builder().setDuplicateHeaderMode(DuplicateHeaderMode.ALLOW_EMPTY).setHeader("A", > "", "B", "").build(); > CSVFormat.DEFAULT.builder().setAllowDuplicateHeaderNames(false) > .setAllowMissingColumnNames(true).setHeader("A", "", "B", "").build(); > > So it seems the logic when building a CSVFormat with explicit headers > is different from when parsing a format in CSVParser.createHeaders(). > That method has a lot of logic to set up the header map. If a > CSVFormat is created with an explicit header the behaviour is > different with respect to duplicate headers, and > allowMissingColumnNames which is not even considered in > CSVFormat.validate(). > > This is the current behaviour in master for the parser: > > @Test > public void testEmptyColumnsAndDuplicateHeadersNotAllowed() throws Exception { > assertThrows(IllegalArgumentException.class, > () -> CSVParser.parse("a,,c,\n1,2,3,4\nw,x,y,z", > CSVFormat.DEFAULT.withHeader().withAllowDuplicateHeaderNames(false))); > } > > @Test > public void > testEmptyColumnsAndDuplicateHeadersNotAllowedWithMissingColumnsAllowed() > throws Exception { > try (CSVParser parser = CSVParser.parse("a,,c,\n1,2,3,4\nw,x,y,z", > > CSVFormat.DEFAULT.withHeader().withAllowDuplicateHeaderNames(false).withAllowMissingColumnNames(true))) > { > // noop > } > } > > So the duplicate headers cannot be considered on its own, it requires > use of the allow missing column names flag. > > Also, if a CSVFormat has an explicit header then the CSVParser will > not construct a header Map<String, Integer> to allow column names to > look up column indices. > > I am not sure of the best path to fix this. I would document the > CSVParser header map as not being created if the header is not read > from the CSV input (current behaviour). Then maybe change the logic in > CSVFormat.validate() to respect the ALLOW_ALL, ALLOW_EMPTY, ALLOW_NONE > duplicates choice, possibly also considering allowEmptyColumnNames > which defaults to false. Ideally we should get the same exceptions > when building a CSVFormat with explicitly set headers, and parsing a > CSV file and getting the header from the file. Currently I do not > think that is true. > > Alex > > On Fri, 21 Oct 2022 at 15:54, Gary Gregory <garydgreg...@gmail.com> wrote: > > > > Ok, I don't see any conflict anymore but the PR title does not match the > > code: "CSVFormat.duplicateHeaderMode requires default DISALLOW" > > > > I am ok with the changes as they stand now but we have to resolve if the > > default should be changed. > > > > Gary > > > > On Fri, Oct 21, 2022, 09:44 <sma...@outlook.de> wrote: > > > > > Hi Gary, > > > > > > the PR did not have conflicts 30 minutes ago. So we must have encountered > > > a race condition between you updating master and me rebasing and > > > force-pushing my pr. > > > I've done that again just now. There are no conflicts. > > > At the same time I see parts of my PR have dribbled into master already, > > > so if you rather implement the changes yourself, please go ahead. I am > > > fine > > > with closing my pr. > > > > > > regards, > > > Markus > > > > > > > > > From: Gary Gregory <garydgreg...@gmail.com> > > > Sent: Friday, October 21, 2022 15:17 > > > To: Commons Developers List <dev@commons.apache.org> > > > Subject: Re: [VOTE] Release Apache Commons CSV 1.10.0 based on RC1 > > > > > > Hi Markus, > > > > > > The PR has conflicts and does not test its changes. > > > > > > I'd like feedback from the community on whether or not git master as it is > > > now is OK for the duplicate header behavior or if changing the default is > > > needed and if doing so is just ping-pongonging from one incompatibility to > > > another, based on previous versions. > > > > > > TY! > > > > > > Gary > > > > > > On Fri, Oct 21, 2022, 08:52 <sma...@outlook.de> wrote: > > > > > > > I've removed serialization stuff from the PR and rebased it to master > > > > https://github.com/apache/commons-csv/pull/276 > > > > > > > > kind regards, > > > > Markus > > > > > > > > From: Gary D. Gregory <ggreg...@apache.org> > > > > Sent: Friday, October 21, 2022 14:18 > > > > To: dev@commons.apache.org <dev@commons.apache.org> > > > > Subject: Re: [VOTE] Release Apache Commons CSV 1.10.0 based on RC1 > > > > > > > > On 2022/10/20 22:56:05 Alex Herbert wrote: > > > > > On Thu, 20 Oct 2022 at 23:43, Alex Herbert <alex.d.herb...@gmail.com> > > > > wrote: > > > > > > > > > > > > I did not have time to track through whether this behaviour changed > > > > > > after the initial implementation of the flag. I would think not as > > > the > > > > > > original behaviour is from 1.0. This would map to: > > > > > > > > > > > > true -> ALLOW_ALL > > > > > > false -> ALLOW_EMPTY > > > > > > new -> DISALLOW > > > > > > > > > > > > Which is what we currently have in 1.10.0 RC1. Thus the PR #276 [7] > > > to > > > > > > change the use of the flag to 'false -> DISALLOW' is not maintaining > > > > > > behavioural compatibility (to 1.7, or back to 1.0). > > > > > > > > > > PS. I just verified that PR 276 changes the DuplicateHeaderMode value > > > > > for allowDuplicates=false and does not change any tests. > > > > > > > > > > So the test suite is currently not enforcing behavioural > > > > > compatibility. This seems like a glaring hole in the tests and should > > > > > be addressed to prevent regressions. > > > > > > > > In git master, there are many tests for senders of > > > > withAllowDuplicateHeaderNames(boolean) and > > > > setAllowDuplicateHeaderNames(boolean). > > > > > > > > I filled in the coverage in testGetAllowDuplicateHeaderNames() for > > > > getAllowDuplicateHeaderNames() to reflect the _current_ behavior, right > > > or > > > > wrong. > > > > > > > > Let's reassess now git master now please. If the PR matters still, it > > > > should be rebased on git master. > > > > > > > > Gary > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org