[CSV] Strategies to handle duplicate headers

Gary Gregory Tue, 20 Jun 2023 04:39:07 -0700

Hi All,

This thread is a follow-up to
https://github.com/apache/commons-csv/pull/309#issuecomment-1441456258


Bruno says:
"With Pandas it automatically deduplicates the column names. Maybe
that's a feature that we could have in Commons CSV too?"

What does that mean and actually do? Say I have column A with row 1
value of "X" and 2nd column A with row 1 value of 2. What do I get
when I ask for column A row 1?

Seth says:
"HeaderStrategy Interface
Contains two functions:

#normalizeHeaders(headings) - With given heading, output a list that
fits with whatever the strategy is going for.
#get(record, header) - Fetch value(s) based on given column name."

I would see perhaps two interfaces so that lambdas might be used more
simply. Maybe, needs an example.

"I'm also wary that this may screw up existing projects that depend on
allowing/disallowing duplicates. i.e. want to allow duplicates and
handle things through indexes / iteration, so this didn't cause a
problem for them and want to preserve header names, and so don't need
the headers deduplicated."

As a first cut whatever we do could/should maintain the existing
behavior. We can change the default later by popular demand.

Gary

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

[CSV] Strategies to handle duplicate headers

Reply via email to