[csv] validation of duplicate headers (was [VOTE] Release Apache Commons CSV 1.10.0 based on RC1)

Alex Herbert Sun, 23 Oct 2022 11:45:14 -0700

On Sun, 23 Oct 2022 at 14:09, Gary D. Gregory <ggreg...@apache.org> wrote:
>
> Hi All, Alex, more below:
>
> On 2022/10/22 21:23:13 Alex Herbert wrote:
> > On Sat, 22 Oct 2022 at 20:05, Gary D. Gregory <ggreg...@apache.org> wrote:
> > >
> > > Thank you for the new tests Alex!
> > >
> > > Here is one area that is easy to overlook: As Commons CSV has evolved, 
> > > _not all settings in CSVFormat_ apply to both writing and parsing. To 
> > > wit, the existing Javadoc for CSVFormat.getAllowMissingColumnNames(): 
> > > "Gets whether missing column names are allowed when parsing the header 
> > > line."
> > >
> > > I've updated the Javadoc for 
> > > CSVFormat.Builder.setAllowMissingColumnNames(boolean) to state this is a 
> > > parser setting.
> > >
> > > I've also updated the test since the commented-out test data is valid for 
> > > parsing.
> > >
> > > In git master now.
> >
> > OK, thanks for the explanation. If the CSVFormat ignores the
> > allowEmptyColumnsNames in the validate() method it should act as if
> > the setting is true, i.e. only the DuplicateHeaderMode should be used
> > to validate the header.
>
> That's how it is now, allowEmptyColumnsNames cannot be considered in the 
> validate method because a CSVFormat does not know if it is going to be used 
> for parsing or writing.
>
> So the test can be updated to stream the full
> > set of cases (reinstating those you commented out) and filter to those
> > where the flag is set to true, e.g. using:
>
> I already had added a new test data provider method to do that so, all I had 
> to do now is make sure that when I deleted the commented out test elements, 
> these were covered in the 2nd data provider called 
> duplicateHeaderParseOnlyData().
>
> >
> > ---
> >
> > static Stream<Arguments> duplicateHeaderAllowsMissingColumnsNamesData() {
> >     return duplicateHeaderData().filter(arg ->
> > Boolean.TRUE.equals(arg.get()[1]));
> > }
>
> I did not do this because it seemed (to me) clearer (less magical) to create 
> a duplicateHeaderParseOnlyData() and list explicitly test cases there. WDYT?


I prefer to keep all the parser test data in one method. For the
CSVFormat data, it can be reused and filtered based on the
allowMissingColumnNames flag. The CSVFormat 'should' behave like the
CSVParser when allowMissingColumnNames is true. It makes sense for the
CSVFormat test cases to be a subset of the CSVParser. Note this may
not always be the case. However with the current code this is mostly
true (see below) and it simplifies adding all the test cases.

I added some more test cases and have found a bug and some more issues.

Bugs

This errors:

CSVFormat:
Arguments.of(DuplicateHeaderMode.DISALLOW,    true,  new String[]
{"A", ""}, true),

There is no duplicate, so nothing should be triggered. But it triggers
on the first empty header. I have fixed this in git master to sanitise
any blank as the empty string "" before checking for duplicates. Thus
only a second duplicate empty name will throw. However the names are
only sanitised for validate(). The original names are stored as is,
thus the CSVFormat.getHeader() will return the original blank names,
for example using two different blanks as {" ", "     "}.

Issues:

In CSVParser using the isBlank method labels an empty header. But
headers are added to the map raw (without trim()). Thus this errors:

CSVParser:
Arguments.of(DuplicateHeaderMode.DISALLOW,    true,  new String[] {"
", "   "}, false),

When the blank strings are different lengths they are both identified
as empty. However they are checked in the duplicates map using
different strings, thus no duplicate is identified, and no exception
raised for a duplicate blank header. Thus at present the CSVFormat and
CSVParser have a different interpretation of what is a duplicate empty
header.

I think that CSVParser should be updated to sanitise all non-null
blank strings to the empty string with regard to duplicates. But this
has to be handled differently to the CSVFormat which keeps a throw
away Set<String> of headers. The CSVParser builds a map using the
current headers. To avoid changing the blank headers to "" would
require keeping a flag to mark an observation of a blank header. The
existing header map functionality can then be left unchanged. This
allows the headers from getHeaders() to contain different blank
strings after parsing.

At present the CSVFormat treats null names and blank names as empty.
So you cannot have an input of {"", null} with DISALLOW. But this is
allowed for CSVParser. That currently ignores null names from
duplicate checking. This is different behaviour from CSVFormat.

I have not made any more changes as I wanted to have a second opinion.
Presently all the test cases in the CSVDuplicateHeaderTest pass for
CSVFormat when allowEmptyColumnNames=true. The commented out cases
have a different result in CSVParser as it ignores duplicates of null
and blank.

I believe the CSVParser handling of null headers is simply because it
uses a Map<String, Integer> to map column names to a column index.
This will not work for null keys when the implementation is a TreeMap
(one choice for the map). So null keys are ignored from the duplicate
check. This leads to one further complication: the CSVParser respects
the CSVFormat.getIgnoreHeaderCase() and uses a TreeMap with an
ignoreCase comparator, otherwise a LinkedHashMap, to store the header
names. So {"A", "a"} can be duplicates.

I have not yet added the ignoreHeaderCase support to
CSVDuplicateHeaderTest. This would fail for CSVFormat. It can be
easily fixed by using a TreeSet with the ignoreCase comparator or else
a HashSet. But I do not support doing this as the ignoreHeaderCase
setting is only for parsing.

Summary:

1. Should CSVParser treat null and blank headers as the same when
checking for duplicates, i.e. all are considered an 'empty' name? This
is current CSVFormat behaviour.
2. Should CSVFormat respect ignoreHeaderCase when checking for
duplicates? This is current CSVParser behaviour.
3. Should blank column names be sanitised to the empty string ""? This
is not current behaviour but is the logical behaviour for checking
duplicates in CSVFormat.

IMO I think we can fix 1. Note that null, "" and "  " throws if
allowEmptyColumnNames is false in the parser. Thus the parser
currently considers all these as the same (empty) for that setting.
The implementation of duplicate checking is just not matching this,
and I believe this is due to using a TreeMap under some conditions
which cannot handle null.

For 2 it seems that ignoreHeaderCase is a setting for parsing. It
affects use of the header names, specifically the header map from the
parser. It does not apply to writing. So this may just require some
extra documentation in the CSVFormat properties and updates to
CSVDuplicateHeaderTest to appropriately test it. It would mean that
checking for duplicate header names can be stricter in the parser than
CSVFormat.

We leave 3 which allows the headers to pass through unchanged and they
can be retrieved using CSVFormat.getHeader() or
CSVParser.getHeaderNames(). The maintains backwards behavioural
compatibility. Note that this allows the parser to be created to allow
duplicate empty names and the resulting map has entries for different
empty name keys such as "" and " ". I would consider this an artifact
where these empty names should not be used by the caller who is
interested in the non-empty column names.

Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

[csv] validation of duplicate headers (was [VOTE] Release Apache Commons CSV 1.10.0 based on RC1)

Reply via email to