Thank you for your excellent investigation, Alex, and for finding and fixing 
the missing Test annotations. Overall, we have 98% coverage.

Taking stock, I think that this is where we are, we have main 3 issues:

1) Compatibility of duplicate header behavior from version to version
2) Validation bug with empty headers
3) Writing and parsing behavior mismatch

One at a time:

1) Compatibility of duplicate header behavior from version to version

- We have behavior [A] from 1.0 to 1.6.
- We have behavior [B] from 1.7 to 1.9.
- We are discussing if 1.10.0 should have behavior [A] or [B]

I would vote for keeping behavior B for simplicity and keeping with the 
principle of least surprise: Maintaining the current behavior is better than 
flip-flopping.

2) Validation bug with empty headers

Fixed in git master commit 289ffa16275c518722c6cda913bfca8a6e1a1411; please do 
verify the tests. I've added 3 test methods per Alex's failing examples below.

3) Writing and parsing behavior mismatch

I've updated the documentation in git master as Alex suggested.

I will cancel the RC1 vote in a separate message with a [CANCEL] subject prefix.

Gary

[A] From Alex:
> From my investigation the behaviour from 1.0 to 1.6 was to throw on
> duplicates, but ignore empty duplicates (so allowing pad columns in a
> CSV).

[B] From Alex:
> The allowDuplicates flag was added in 1.7 to allow duplicates. DEFAULT
> behaviour was true! So the behavioural compatibility was broken in
> release 1.7. Note there is also an allowMissingColumnNames flag. Logic
> was:
> 
> - Store column headers as we go
> - For each new column header, check in the existing headers
> - If present:
> -- If not empty then throw if allowDuplicateHeaderNames=true
> -- If empty then throw if allowMissingColumnNames=false
> 
> There is some logic failure here. It would allow the first missing
> column name for example; only the second empty missing column name
> would trigger an exception.


On 2022/10/21 16:33:23 Alex Herbert wrote:
> From my investigation the behaviour from 1.0 to 1.6 was to throw on
> duplicates, but ignore empty duplicates (so allowing pad columns in a
> CSV).
> 
> The allowDuplicates flag was added in 1.7 to allow duplicates. DEFAULT
> behaviour was true! So the behavioural compatibility was broken in
> release 1.7. Note there is also an allowMissingColumnNames flag. Logic
> was:
> 
> - Store column headers as we go
> - For each new column header, check in the existing headers
> - If present:
> -- If not empty then throw if allowDuplicateHeaderNames=true
> -- If empty then throw if allowMissingColumnNames=false
> 
> There is some logic failure here. It would allow the first missing
> column name for example; only the second empty missing column name
> would trigger an exception.
> 
> The current master has updated this logic somewhat but I think the
> concept is still the same. There are tests to show that the default
> enum value is ALLOW_ALL. This is the same behaviour as 1.7, but not
> 1.0 - 1.6.
> 
> Setting the boolean flag to true -> ALLOW_ALL.
> Setting the boolean flag to false -> ALLOW_EMPTY.
> 
> So I think the default settings are correct, and the use of the old
> flag flips the behaviour correctly. But when investigating I found
> another issue.
> 
> I looked at the CSVFormat tests and found two fixtures missing the
> @Test annotation:
> 
> testDuplicateHeaderElementsTrue()
> testDuplicateHeaderElementsTrue_Deprecated()
> 
> I added these to master.
> 
> The tests for duplicate headers use the builder and setHeader("A",
> "A"). So there are no tests for empty headers. Trying this throws an
> exception in CSVFormat.validate for all of these:
> 
> CSVFormat.DEFAULT.builder().setAllowDuplicateHeaderNames(false).setHeader("A",
> "", "B", "").build();
> CSVFormat.DEFAULT.builder().setDuplicateHeaderMode(DuplicateHeaderMode.ALLOW_EMPTY).setHeader("A",
> "", "B", "").build();
> CSVFormat.DEFAULT.builder().setAllowDuplicateHeaderNames(false)
>     .setAllowMissingColumnNames(true).setHeader("A", "", "B", "").build();
> 
> So it seems the logic when building a CSVFormat with explicit headers
> is different from when parsing a format in CSVParser.createHeaders().
> That method has a lot of logic to set up the header map. If a
> CSVFormat is created with an explicit header the behaviour is
> different with respect to duplicate headers, and
> allowMissingColumnNames which is not even considered in
> CSVFormat.validate().
> 
> This is the current behaviour in master for the parser:
> 
> @Test
> public void testEmptyColumnsAndDuplicateHeadersNotAllowed() throws Exception {
>     assertThrows(IllegalArgumentException.class,
>         () -> CSVParser.parse("a,,c,\n1,2,3,4\nw,x,y,z",
> CSVFormat.DEFAULT.withHeader().withAllowDuplicateHeaderNames(false)));
> }
> 
> @Test
> public void 
> testEmptyColumnsAndDuplicateHeadersNotAllowedWithMissingColumnsAllowed()
> throws Exception {
>     try (CSVParser parser = CSVParser.parse("a,,c,\n1,2,3,4\nw,x,y,z",
>             
> CSVFormat.DEFAULT.withHeader().withAllowDuplicateHeaderNames(false).withAllowMissingColumnNames(true)))
> {
>         // noop
>     }
> }
> 
> So the duplicate headers cannot be considered on its own, it requires
> use of the allow missing column names flag.
> 
> Also, if a CSVFormat has an explicit header then the CSVParser will
> not construct a header Map<String, Integer> to allow column names to
> look up column indices.
> 
> I am not sure of the best path to fix this. I would document the
> CSVParser header map as not being created if the header is not read
> from the CSV input (current behaviour). Then maybe change the logic in
> CSVFormat.validate() to respect the ALLOW_ALL, ALLOW_EMPTY, ALLOW_NONE
> duplicates choice, possibly also considering allowEmptyColumnNames
> which defaults to false. Ideally we should get the same exceptions
> when building a CSVFormat with explicitly set headers, and parsing a
> CSV file and getting the header from the file. Currently I do not
> think that is true.
> 
> Alex
> 
> On Fri, 21 Oct 2022 at 15:54, Gary Gregory <garydgreg...@gmail.com> wrote:
> >
> > Ok, I don't see any conflict anymore but the PR title does not match the
> > code: "CSVFormat.duplicateHeaderMode requires default DISALLOW"
> >
> > I am ok with the changes as they stand now but we have to resolve if the
> > default should be changed.
> >
> > Gary
> >
> > On Fri, Oct 21, 2022, 09:44 <sma...@outlook.de> wrote:
> >
> > > Hi Gary,
> > >
> > > the PR did not have conflicts 30 minutes ago. So we must have encountered
> > > a race condition between you updating master and me rebasing and
> > > force-pushing my pr.
> > > I've done that again just now. There are no conflicts.
> > > At the same time I see parts of my PR have dribbled into master already,
> > > so if you rather implement the changes yourself, please go ahead. I am 
> > > fine
> > > with closing my pr.
> > >
> > > regards,
> > > Markus
> > >
> > >
> > > From: Gary Gregory <garydgreg...@gmail.com>
> > > Sent: Friday, October 21, 2022 15:17
> > > To: Commons Developers List <dev@commons.apache.org>
> > > Subject: Re: [VOTE] Release Apache Commons CSV 1.10.0 based on RC1
> > >
> > > Hi Markus,
> > >
> > > The PR has conflicts and does not test its changes.
> > >
> > > I'd like feedback from the community on whether or not git master as it is
> > > now is OK for the duplicate header behavior or if changing the default is
> > > needed and if doing so is just ping-pongonging from one incompatibility to
> > > another, based on previous versions.
> > >
> > > TY!
> > >
> > > Gary
> > >
> > > On Fri, Oct 21, 2022, 08:52 <sma...@outlook.de> wrote:
> > >
> > > > I've removed serialization stuff from the PR and rebased it to master
> > > > https://github.com/apache/commons-csv/pull/276
> > > >
> > > > kind regards,
> > > > Markus
> > > >
> > > > From: Gary D. Gregory <ggreg...@apache.org>
> > > > Sent: Friday, October 21, 2022 14:18
> > > > To: dev@commons.apache.org <dev@commons.apache.org>
> > > > Subject: Re: [VOTE] Release Apache Commons CSV 1.10.0 based on RC1
> > > >
> > > > On 2022/10/20 22:56:05 Alex Herbert wrote:
> > > > > On Thu, 20 Oct 2022 at 23:43, Alex Herbert <alex.d.herb...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > I did not have time to track through whether this behaviour changed
> > > > > > after the initial implementation of the flag. I would think not as
> > > the
> > > > > > original behaviour is from 1.0. This would map to:
> > > > > >
> > > > > > true -> ALLOW_ALL
> > > > > > false -> ALLOW_EMPTY
> > > > > > new -> DISALLOW
> > > > > >
> > > > > > Which is what we currently have in 1.10.0 RC1. Thus the PR #276 [7]
> > > to
> > > > > > change the use of the flag to 'false -> DISALLOW' is not maintaining
> > > > > > behavioural compatibility (to 1.7, or back to 1.0).
> > > > >
> > > > > PS. I just verified that PR 276 changes the DuplicateHeaderMode value
> > > > > for allowDuplicates=false and does not change any tests.
> > > > >
> > > > > So the test suite is currently not enforcing behavioural
> > > > > compatibility. This seems like a glaring hole in the tests and should
> > > > > be addressed to prevent regressions.
> > > >
> > > > In git master, there are many tests for senders of
> > > > withAllowDuplicateHeaderNames(boolean) and
> > > > setAllowDuplicateHeaderNames(boolean).
> > > >
> > > > I filled in the coverage in testGetAllowDuplicateHeaderNames() for
> > > > getAllowDuplicateHeaderNames() to reflect the _current_ behavior, right
> > > or
> > > > wrong.
> > > >
> > > > Let's reassess now git master now please. If the PR matters still, it
> > > > should be rebased on git master.
> > > >
> > > > Gary
> > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > > > For additional commands, e-mail: dev-h...@commons.apache.org
> > > > >
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > > For additional commands, e-mail: dev-h...@commons.apache.org
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > > For additional commands, e-mail: dev-h...@commons.apache.org
> > > >
> > > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > For additional commands, e-mail: dev-h...@commons.apache.org
> > >
> > >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to