This VOTE is canceled. I will roll an RC2 if we agree that git master is OK.

Gary

On 2022/10/22 12:56:40 "Gary D. Gregory" wrote:
> Thank you for your excellent investigation, Alex, and for finding and fixing 
> the missing Test annotations. Overall, we have 98% coverage.
> 
> Taking stock, I think that this is where we are, we have main 3 issues:
> 
> 1) Compatibility of duplicate header behavior from version to version
> 2) Validation bug with empty headers
> 3) Writing and parsing behavior mismatch
> 
> One at a time:
> 
> 1) Compatibility of duplicate header behavior from version to version
> 
> - We have behavior [A] from 1.0 to 1.6.
> - We have behavior [B] from 1.7 to 1.9.
> - We are discussing if 1.10.0 should have behavior [A] or [B]
> 
> I would vote for keeping behavior B for simplicity and keeping with the 
> principle of least surprise: Maintaining the current behavior is better than 
> flip-flopping.
> 
> 2) Validation bug with empty headers
> 
> Fixed in git master commit 289ffa16275c518722c6cda913bfca8a6e1a1411; please 
> do verify the tests. I've added 3 test methods per Alex's failing examples 
> below.
> 
> 3) Writing and parsing behavior mismatch
> 
> I've updated the documentation in git master as Alex suggested.
> 
> I will cancel the RC1 vote in a separate message with a [CANCEL] subject 
> prefix.
> 
> Gary
> 
> [A] From Alex:
> > From my investigation the behaviour from 1.0 to 1.6 was to throw on
> > duplicates, but ignore empty duplicates (so allowing pad columns in a
> > CSV).
> 
> [B] From Alex:
> > The allowDuplicates flag was added in 1.7 to allow duplicates. DEFAULT
> > behaviour was true! So the behavioural compatibility was broken in
> > release 1.7. Note there is also an allowMissingColumnNames flag. Logic
> > was:
> > 
> > - Store column headers as we go
> > - For each new column header, check in the existing headers
> > - If present:
> > -- If not empty then throw if allowDuplicateHeaderNames=true
> > -- If empty then throw if allowMissingColumnNames=false
> > 
> > There is some logic failure here. It would allow the first missing
> > column name for example; only the second empty missing column name
> > would trigger an exception.
> 
> 
> On 2022/10/21 16:33:23 Alex Herbert wrote:
> > From my investigation the behaviour from 1.0 to 1.6 was to throw on
> > duplicates, but ignore empty duplicates (so allowing pad columns in a
> > CSV).
> > 
> > The allowDuplicates flag was added in 1.7 to allow duplicates. DEFAULT
> > behaviour was true! So the behavioural compatibility was broken in
> > release 1.7. Note there is also an allowMissingColumnNames flag. Logic
> > was:
> > 
> > - Store column headers as we go
> > - For each new column header, check in the existing headers
> > - If present:
> > -- If not empty then throw if allowDuplicateHeaderNames=true
> > -- If empty then throw if allowMissingColumnNames=false
> > 
> > There is some logic failure here. It would allow the first missing
> > column name for example; only the second empty missing column name
> > would trigger an exception.
> > 
> > The current master has updated this logic somewhat but I think the
> > concept is still the same. There are tests to show that the default
> > enum value is ALLOW_ALL. This is the same behaviour as 1.7, but not
> > 1.0 - 1.6.
> > 
> > Setting the boolean flag to true -> ALLOW_ALL.
> > Setting the boolean flag to false -> ALLOW_EMPTY.
> > 
> > So I think the default settings are correct, and the use of the old
> > flag flips the behaviour correctly. But when investigating I found
> > another issue.
> > 
> > I looked at the CSVFormat tests and found two fixtures missing the
> > @Test annotation:
> > 
> > testDuplicateHeaderElementsTrue()
> > testDuplicateHeaderElementsTrue_Deprecated()
> > 
> > I added these to master.
> > 
> > The tests for duplicate headers use the builder and setHeader("A",
> > "A"). So there are no tests for empty headers. Trying this throws an
> > exception in CSVFormat.validate for all of these:
> > 
> > CSVFormat.DEFAULT.builder().setAllowDuplicateHeaderNames(false).setHeader("A",
> > "", "B", "").build();
> > CSVFormat.DEFAULT.builder().setDuplicateHeaderMode(DuplicateHeaderMode.ALLOW_EMPTY).setHeader("A",
> > "", "B", "").build();
> > CSVFormat.DEFAULT.builder().setAllowDuplicateHeaderNames(false)
> >     .setAllowMissingColumnNames(true).setHeader("A", "", "B", "").build();
> > 
> > So it seems the logic when building a CSVFormat with explicit headers
> > is different from when parsing a format in CSVParser.createHeaders().
> > That method has a lot of logic to set up the header map. If a
> > CSVFormat is created with an explicit header the behaviour is
> > different with respect to duplicate headers, and
> > allowMissingColumnNames which is not even considered in
> > CSVFormat.validate().
> > 
> > This is the current behaviour in master for the parser:
> > 
> > @Test
> > public void testEmptyColumnsAndDuplicateHeadersNotAllowed() throws 
> > Exception {
> >     assertThrows(IllegalArgumentException.class,
> >         () -> CSVParser.parse("a,,c,\n1,2,3,4\nw,x,y,z",
> > CSVFormat.DEFAULT.withHeader().withAllowDuplicateHeaderNames(false)));
> > }
> > 
> > @Test
> > public void 
> > testEmptyColumnsAndDuplicateHeadersNotAllowedWithMissingColumnsAllowed()
> > throws Exception {
> >     try (CSVParser parser = CSVParser.parse("a,,c,\n1,2,3,4\nw,x,y,z",
> >             
> > CSVFormat.DEFAULT.withHeader().withAllowDuplicateHeaderNames(false).withAllowMissingColumnNames(true)))
> > {
> >         // noop
> >     }
> > }
> > 
> > So the duplicate headers cannot be considered on its own, it requires
> > use of the allow missing column names flag.
> > 
> > Also, if a CSVFormat has an explicit header then the CSVParser will
> > not construct a header Map<String, Integer> to allow column names to
> > look up column indices.
> > 
> > I am not sure of the best path to fix this. I would document the
> > CSVParser header map as not being created if the header is not read
> > from the CSV input (current behaviour). Then maybe change the logic in
> > CSVFormat.validate() to respect the ALLOW_ALL, ALLOW_EMPTY, ALLOW_NONE
> > duplicates choice, possibly also considering allowEmptyColumnNames
> > which defaults to false. Ideally we should get the same exceptions
> > when building a CSVFormat with explicitly set headers, and parsing a
> > CSV file and getting the header from the file. Currently I do not
> > think that is true.
> > 
> > Alex
> > 
> > On Fri, 21 Oct 2022 at 15:54, Gary Gregory <garydgreg...@gmail.com> wrote:
> > >
> > > Ok, I don't see any conflict anymore but the PR title does not match the
> > > code: "CSVFormat.duplicateHeaderMode requires default DISALLOW"
> > >
> > > I am ok with the changes as they stand now but we have to resolve if the
> > > default should be changed.
> > >
> > > Gary
> > >
> > > On Fri, Oct 21, 2022, 09:44 <sma...@outlook.de> wrote:
> > >
> > > > Hi Gary,
> > > >
> > > > the PR did not have conflicts 30 minutes ago. So we must have 
> > > > encountered
> > > > a race condition between you updating master and me rebasing and
> > > > force-pushing my pr.
> > > > I've done that again just now. There are no conflicts.
> > > > At the same time I see parts of my PR have dribbled into master already,
> > > > so if you rather implement the changes yourself, please go ahead. I am 
> > > > fine
> > > > with closing my pr.
> > > >
> > > > regards,
> > > > Markus
> > > >
> > > >
> > > > From: Gary Gregory <garydgreg...@gmail.com>
> > > > Sent: Friday, October 21, 2022 15:17
> > > > To: Commons Developers List <dev@commons.apache.org>
> > > > Subject: Re: [VOTE] Release Apache Commons CSV 1.10.0 based on RC1
> > > >
> > > > Hi Markus,
> > > >
> > > > The PR has conflicts and does not test its changes.
> > > >
> > > > I'd like feedback from the community on whether or not git master as it 
> > > > is
> > > > now is OK for the duplicate header behavior or if changing the default 
> > > > is
> > > > needed and if doing so is just ping-pongonging from one incompatibility 
> > > > to
> > > > another, based on previous versions.
> > > >
> > > > TY!
> > > >
> > > > Gary
> > > >
> > > > On Fri, Oct 21, 2022, 08:52 <sma...@outlook.de> wrote:
> > > >
> > > > > I've removed serialization stuff from the PR and rebased it to master
> > > > > https://github.com/apache/commons-csv/pull/276
> > > > >
> > > > > kind regards,
> > > > > Markus
> > > > >
> > > > > From: Gary D. Gregory <ggreg...@apache.org>
> > > > > Sent: Friday, October 21, 2022 14:18
> > > > > To: dev@commons.apache.org <dev@commons.apache.org>
> > > > > Subject: Re: [VOTE] Release Apache Commons CSV 1.10.0 based on RC1
> > > > >
> > > > > On 2022/10/20 22:56:05 Alex Herbert wrote:
> > > > > > On Thu, 20 Oct 2022 at 23:43, Alex Herbert 
> > > > > > <alex.d.herb...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > I did not have time to track through whether this behaviour 
> > > > > > > changed
> > > > > > > after the initial implementation of the flag. I would think not as
> > > > the
> > > > > > > original behaviour is from 1.0. This would map to:
> > > > > > >
> > > > > > > true -> ALLOW_ALL
> > > > > > > false -> ALLOW_EMPTY
> > > > > > > new -> DISALLOW
> > > > > > >
> > > > > > > Which is what we currently have in 1.10.0 RC1. Thus the PR #276 
> > > > > > > [7]
> > > > to
> > > > > > > change the use of the flag to 'false -> DISALLOW' is not 
> > > > > > > maintaining
> > > > > > > behavioural compatibility (to 1.7, or back to 1.0).
> > > > > >
> > > > > > PS. I just verified that PR 276 changes the DuplicateHeaderMode 
> > > > > > value
> > > > > > for allowDuplicates=false and does not change any tests.
> > > > > >
> > > > > > So the test suite is currently not enforcing behavioural
> > > > > > compatibility. This seems like a glaring hole in the tests and 
> > > > > > should
> > > > > > be addressed to prevent regressions.
> > > > >
> > > > > In git master, there are many tests for senders of
> > > > > withAllowDuplicateHeaderNames(boolean) and
> > > > > setAllowDuplicateHeaderNames(boolean).
> > > > >
> > > > > I filled in the coverage in testGetAllowDuplicateHeaderNames() for
> > > > > getAllowDuplicateHeaderNames() to reflect the _current_ behavior, 
> > > > > right
> > > > or
> > > > > wrong.
> > > > >
> > > > > Let's reassess now git master now please. If the PR matters still, it
> > > > > should be rebased on git master.
> > > > >
> > > > > Gary
> > > > >
> > > > > >
> > > > > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > > > > For additional commands, e-mail: dev-h...@commons.apache.org
> > > > > >
> > > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > > > For additional commands, e-mail: dev-h...@commons.apache.org
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > > > For additional commands, e-mail: dev-h...@commons.apache.org
> > > > >
> > > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > > For additional commands, e-mail: dev-h...@commons.apache.org
> > > >
> > > >
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> > 
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to