This VOTE is canceled. I will roll an RC2 if we agree that git master is OK.
Gary On 2022/10/22 12:56:40 "Gary D. Gregory" wrote: > Thank you for your excellent investigation, Alex, and for finding and fixing > the missing Test annotations. Overall, we have 98% coverage. > > Taking stock, I think that this is where we are, we have main 3 issues: > > 1) Compatibility of duplicate header behavior from version to version > 2) Validation bug with empty headers > 3) Writing and parsing behavior mismatch > > One at a time: > > 1) Compatibility of duplicate header behavior from version to version > > - We have behavior [A] from 1.0 to 1.6. > - We have behavior [B] from 1.7 to 1.9. > - We are discussing if 1.10.0 should have behavior [A] or [B] > > I would vote for keeping behavior B for simplicity and keeping with the > principle of least surprise: Maintaining the current behavior is better than > flip-flopping. > > 2) Validation bug with empty headers > > Fixed in git master commit 289ffa16275c518722c6cda913bfca8a6e1a1411; please > do verify the tests. I've added 3 test methods per Alex's failing examples > below. > > 3) Writing and parsing behavior mismatch > > I've updated the documentation in git master as Alex suggested. > > I will cancel the RC1 vote in a separate message with a [CANCEL] subject > prefix. > > Gary > > [A] From Alex: > > From my investigation the behaviour from 1.0 to 1.6 was to throw on > > duplicates, but ignore empty duplicates (so allowing pad columns in a > > CSV). > > [B] From Alex: > > The allowDuplicates flag was added in 1.7 to allow duplicates. DEFAULT > > behaviour was true! So the behavioural compatibility was broken in > > release 1.7. Note there is also an allowMissingColumnNames flag. Logic > > was: > > > > - Store column headers as we go > > - For each new column header, check in the existing headers > > - If present: > > -- If not empty then throw if allowDuplicateHeaderNames=true > > -- If empty then throw if allowMissingColumnNames=false > > > > There is some logic failure here. It would allow the first missing > > column name for example; only the second empty missing column name > > would trigger an exception. > > > On 2022/10/21 16:33:23 Alex Herbert wrote: > > From my investigation the behaviour from 1.0 to 1.6 was to throw on > > duplicates, but ignore empty duplicates (so allowing pad columns in a > > CSV). > > > > The allowDuplicates flag was added in 1.7 to allow duplicates. DEFAULT > > behaviour was true! So the behavioural compatibility was broken in > > release 1.7. Note there is also an allowMissingColumnNames flag. Logic > > was: > > > > - Store column headers as we go > > - For each new column header, check in the existing headers > > - If present: > > -- If not empty then throw if allowDuplicateHeaderNames=true > > -- If empty then throw if allowMissingColumnNames=false > > > > There is some logic failure here. It would allow the first missing > > column name for example; only the second empty missing column name > > would trigger an exception. > > > > The current master has updated this logic somewhat but I think the > > concept is still the same. There are tests to show that the default > > enum value is ALLOW_ALL. This is the same behaviour as 1.7, but not > > 1.0 - 1.6. > > > > Setting the boolean flag to true -> ALLOW_ALL. > > Setting the boolean flag to false -> ALLOW_EMPTY. > > > > So I think the default settings are correct, and the use of the old > > flag flips the behaviour correctly. But when investigating I found > > another issue. > > > > I looked at the CSVFormat tests and found two fixtures missing the > > @Test annotation: > > > > testDuplicateHeaderElementsTrue() > > testDuplicateHeaderElementsTrue_Deprecated() > > > > I added these to master. > > > > The tests for duplicate headers use the builder and setHeader("A", > > "A"). So there are no tests for empty headers. Trying this throws an > > exception in CSVFormat.validate for all of these: > > > > CSVFormat.DEFAULT.builder().setAllowDuplicateHeaderNames(false).setHeader("A", > > "", "B", "").build(); > > CSVFormat.DEFAULT.builder().setDuplicateHeaderMode(DuplicateHeaderMode.ALLOW_EMPTY).setHeader("A", > > "", "B", "").build(); > > CSVFormat.DEFAULT.builder().setAllowDuplicateHeaderNames(false) > > .setAllowMissingColumnNames(true).setHeader("A", "", "B", "").build(); > > > > So it seems the logic when building a CSVFormat with explicit headers > > is different from when parsing a format in CSVParser.createHeaders(). > > That method has a lot of logic to set up the header map. If a > > CSVFormat is created with an explicit header the behaviour is > > different with respect to duplicate headers, and > > allowMissingColumnNames which is not even considered in > > CSVFormat.validate(). > > > > This is the current behaviour in master for the parser: > > > > @Test > > public void testEmptyColumnsAndDuplicateHeadersNotAllowed() throws > > Exception { > > assertThrows(IllegalArgumentException.class, > > () -> CSVParser.parse("a,,c,\n1,2,3,4\nw,x,y,z", > > CSVFormat.DEFAULT.withHeader().withAllowDuplicateHeaderNames(false))); > > } > > > > @Test > > public void > > testEmptyColumnsAndDuplicateHeadersNotAllowedWithMissingColumnsAllowed() > > throws Exception { > > try (CSVParser parser = CSVParser.parse("a,,c,\n1,2,3,4\nw,x,y,z", > > > > CSVFormat.DEFAULT.withHeader().withAllowDuplicateHeaderNames(false).withAllowMissingColumnNames(true))) > > { > > // noop > > } > > } > > > > So the duplicate headers cannot be considered on its own, it requires > > use of the allow missing column names flag. > > > > Also, if a CSVFormat has an explicit header then the CSVParser will > > not construct a header Map<String, Integer> to allow column names to > > look up column indices. > > > > I am not sure of the best path to fix this. I would document the > > CSVParser header map as not being created if the header is not read > > from the CSV input (current behaviour). Then maybe change the logic in > > CSVFormat.validate() to respect the ALLOW_ALL, ALLOW_EMPTY, ALLOW_NONE > > duplicates choice, possibly also considering allowEmptyColumnNames > > which defaults to false. Ideally we should get the same exceptions > > when building a CSVFormat with explicitly set headers, and parsing a > > CSV file and getting the header from the file. Currently I do not > > think that is true. > > > > Alex > > > > On Fri, 21 Oct 2022 at 15:54, Gary Gregory <garydgreg...@gmail.com> wrote: > > > > > > Ok, I don't see any conflict anymore but the PR title does not match the > > > code: "CSVFormat.duplicateHeaderMode requires default DISALLOW" > > > > > > I am ok with the changes as they stand now but we have to resolve if the > > > default should be changed. > > > > > > Gary > > > > > > On Fri, Oct 21, 2022, 09:44 <sma...@outlook.de> wrote: > > > > > > > Hi Gary, > > > > > > > > the PR did not have conflicts 30 minutes ago. So we must have > > > > encountered > > > > a race condition between you updating master and me rebasing and > > > > force-pushing my pr. > > > > I've done that again just now. There are no conflicts. > > > > At the same time I see parts of my PR have dribbled into master already, > > > > so if you rather implement the changes yourself, please go ahead. I am > > > > fine > > > > with closing my pr. > > > > > > > > regards, > > > > Markus > > > > > > > > > > > > From: Gary Gregory <garydgreg...@gmail.com> > > > > Sent: Friday, October 21, 2022 15:17 > > > > To: Commons Developers List <dev@commons.apache.org> > > > > Subject: Re: [VOTE] Release Apache Commons CSV 1.10.0 based on RC1 > > > > > > > > Hi Markus, > > > > > > > > The PR has conflicts and does not test its changes. > > > > > > > > I'd like feedback from the community on whether or not git master as it > > > > is > > > > now is OK for the duplicate header behavior or if changing the default > > > > is > > > > needed and if doing so is just ping-pongonging from one incompatibility > > > > to > > > > another, based on previous versions. > > > > > > > > TY! > > > > > > > > Gary > > > > > > > > On Fri, Oct 21, 2022, 08:52 <sma...@outlook.de> wrote: > > > > > > > > > I've removed serialization stuff from the PR and rebased it to master > > > > > https://github.com/apache/commons-csv/pull/276 > > > > > > > > > > kind regards, > > > > > Markus > > > > > > > > > > From: Gary D. Gregory <ggreg...@apache.org> > > > > > Sent: Friday, October 21, 2022 14:18 > > > > > To: dev@commons.apache.org <dev@commons.apache.org> > > > > > Subject: Re: [VOTE] Release Apache Commons CSV 1.10.0 based on RC1 > > > > > > > > > > On 2022/10/20 22:56:05 Alex Herbert wrote: > > > > > > On Thu, 20 Oct 2022 at 23:43, Alex Herbert > > > > > > <alex.d.herb...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > I did not have time to track through whether this behaviour > > > > > > > changed > > > > > > > after the initial implementation of the flag. I would think not as > > > > the > > > > > > > original behaviour is from 1.0. This would map to: > > > > > > > > > > > > > > true -> ALLOW_ALL > > > > > > > false -> ALLOW_EMPTY > > > > > > > new -> DISALLOW > > > > > > > > > > > > > > Which is what we currently have in 1.10.0 RC1. Thus the PR #276 > > > > > > > [7] > > > > to > > > > > > > change the use of the flag to 'false -> DISALLOW' is not > > > > > > > maintaining > > > > > > > behavioural compatibility (to 1.7, or back to 1.0). > > > > > > > > > > > > PS. I just verified that PR 276 changes the DuplicateHeaderMode > > > > > > value > > > > > > for allowDuplicates=false and does not change any tests. > > > > > > > > > > > > So the test suite is currently not enforcing behavioural > > > > > > compatibility. This seems like a glaring hole in the tests and > > > > > > should > > > > > > be addressed to prevent regressions. > > > > > > > > > > In git master, there are many tests for senders of > > > > > withAllowDuplicateHeaderNames(boolean) and > > > > > setAllowDuplicateHeaderNames(boolean). > > > > > > > > > > I filled in the coverage in testGetAllowDuplicateHeaderNames() for > > > > > getAllowDuplicateHeaderNames() to reflect the _current_ behavior, > > > > > right > > > > or > > > > > wrong. > > > > > > > > > > Let's reassess now git master now please. If the PR matters still, it > > > > > should be rebased on git master. > > > > > > > > > > Gary > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org