On Sun, 23 Oct 2022 at 14:09, Gary D. Gregory <ggreg...@apache.org> wrote: > > Hi All, Alex, more below: > > On 2022/10/22 21:23:13 Alex Herbert wrote: > > On Sat, 22 Oct 2022 at 20:05, Gary D. Gregory <ggreg...@apache.org> wrote: > > > > > > Thank you for the new tests Alex! > > > > > > Here is one area that is easy to overlook: As Commons CSV has evolved, > > > _not all settings in CSVFormat_ apply to both writing and parsing. To > > > wit, the existing Javadoc for CSVFormat.getAllowMissingColumnNames(): > > > "Gets whether missing column names are allowed when parsing the header > > > line." > > > > > > I've updated the Javadoc for > > > CSVFormat.Builder.setAllowMissingColumnNames(boolean) to state this is a > > > parser setting. > > > > > > I've also updated the test since the commented-out test data is valid for > > > parsing. > > > > > > In git master now. > > > > OK, thanks for the explanation. If the CSVFormat ignores the > > allowEmptyColumnsNames in the validate() method it should act as if > > the setting is true, i.e. only the DuplicateHeaderMode should be used > > to validate the header. > > That's how it is now, allowEmptyColumnsNames cannot be considered in the > validate method because a CSVFormat does not know if it is going to be used > for parsing or writing. > > So the test can be updated to stream the full > > set of cases (reinstating those you commented out) and filter to those > > where the flag is set to true, e.g. using: > > I already had added a new test data provider method to do that so, all I had > to do now is make sure that when I deleted the commented out test elements, > these were covered in the 2nd data provider called > duplicateHeaderParseOnlyData(). > > > > > --- > > > > static Stream<Arguments> duplicateHeaderAllowsMissingColumnsNamesData() { > > return duplicateHeaderData().filter(arg -> > > Boolean.TRUE.equals(arg.get()[1])); > > } > > I did not do this because it seemed (to me) clearer (less magical) to create > a duplicateHeaderParseOnlyData() and list explicitly test cases there. WDYT?
I prefer to keep all the parser test data in one method. For the CSVFormat data, it can be reused and filtered based on the allowMissingColumnNames flag. The CSVFormat 'should' behave like the CSVParser when allowMissingColumnNames is true. It makes sense for the CSVFormat test cases to be a subset of the CSVParser. Note this may not always be the case. However with the current code this is mostly true (see below) and it simplifies adding all the test cases. I added some more test cases and have found a bug and some more issues. Bugs This errors: CSVFormat: Arguments.of(DuplicateHeaderMode.DISALLOW, true, new String[] {"A", ""}, true), There is no duplicate, so nothing should be triggered. But it triggers on the first empty header. I have fixed this in git master to sanitise any blank as the empty string "" before checking for duplicates. Thus only a second duplicate empty name will throw. However the names are only sanitised for validate(). The original names are stored as is, thus the CSVFormat.getHeader() will return the original blank names, for example using two different blanks as {" ", " "}. Issues: In CSVParser using the isBlank method labels an empty header. But headers are added to the map raw (without trim()). Thus this errors: CSVParser: Arguments.of(DuplicateHeaderMode.DISALLOW, true, new String[] {" ", " "}, false), When the blank strings are different lengths they are both identified as empty. However they are checked in the duplicates map using different strings, thus no duplicate is identified, and no exception raised for a duplicate blank header. Thus at present the CSVFormat and CSVParser have a different interpretation of what is a duplicate empty header. I think that CSVParser should be updated to sanitise all non-null blank strings to the empty string with regard to duplicates. But this has to be handled differently to the CSVFormat which keeps a throw away Set<String> of headers. The CSVParser builds a map using the current headers. To avoid changing the blank headers to "" would require keeping a flag to mark an observation of a blank header. The existing header map functionality can then be left unchanged. This allows the headers from getHeaders() to contain different blank strings after parsing. At present the CSVFormat treats null names and blank names as empty. So you cannot have an input of {"", null} with DISALLOW. But this is allowed for CSVParser. That currently ignores null names from duplicate checking. This is different behaviour from CSVFormat. I have not made any more changes as I wanted to have a second opinion. Presently all the test cases in the CSVDuplicateHeaderTest pass for CSVFormat when allowEmptyColumnNames=true. The commented out cases have a different result in CSVParser as it ignores duplicates of null and blank. I believe the CSVParser handling of null headers is simply because it uses a Map<String, Integer> to map column names to a column index. This will not work for null keys when the implementation is a TreeMap (one choice for the map). So null keys are ignored from the duplicate check. This leads to one further complication: the CSVParser respects the CSVFormat.getIgnoreHeaderCase() and uses a TreeMap with an ignoreCase comparator, otherwise a LinkedHashMap, to store the header names. So {"A", "a"} can be duplicates. I have not yet added the ignoreHeaderCase support to CSVDuplicateHeaderTest. This would fail for CSVFormat. It can be easily fixed by using a TreeSet with the ignoreCase comparator or else a HashSet. But I do not support doing this as the ignoreHeaderCase setting is only for parsing. Summary: 1. Should CSVParser treat null and blank headers as the same when checking for duplicates, i.e. all are considered an 'empty' name? This is current CSVFormat behaviour. 2. Should CSVFormat respect ignoreHeaderCase when checking for duplicates? This is current CSVParser behaviour. 3. Should blank column names be sanitised to the empty string ""? This is not current behaviour but is the logical behaviour for checking duplicates in CSVFormat. IMO I think we can fix 1. Note that null, "" and " " throws if allowEmptyColumnNames is false in the parser. Thus the parser currently considers all these as the same (empty) for that setting. The implementation of duplicate checking is just not matching this, and I believe this is due to using a TreeMap under some conditions which cannot handle null. For 2 it seems that ignoreHeaderCase is a setting for parsing. It affects use of the header names, specifically the header map from the parser. It does not apply to writing. So this may just require some extra documentation in the CSVFormat properties and updates to CSVDuplicateHeaderTest to appropriately test it. It would mean that checking for duplicate header names can be stricter in the parser than CSVFormat. We leave 3 which allows the headers to pass through unchanged and they can be retrieved using CSVFormat.getHeader() or CSVParser.getHeaderNames(). The maintains backwards behavioural compatibility. Note that this allows the parser to be created to allow duplicate empty names and the resulting map has entries for different empty name keys such as "" and " ". I would consider this an artifact where these empty names should not be used by the caller who is interested in the non-empty column names. Alex --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org