Re: [CSV] Headers and the first record

Gary Gregory Wed, 31 Jul 2013 08:36:40 -0700

On Wed, Jul 31, 2013 at 11:14 AM, Mark Fortner <phidia...@gmail.com> wrote:


> I took a brief look at the API for CSV, and thought I would share a typical
> use case from the biotech industry.  We deal with a lot of instruments that
> produce a multiline header.  The header usually contains "experiment
> conditions".  You can think of this as metadata for the columnar data.  The
> experiment conditions usually contain things like the name of the scientist
> using the instrument, the time of day the experiment was run, and some
> instrument configuration settings.  Usually when we parse CSV files, we
> have to parse the header first, extract all relevant data, and then parse
> the rows of data.
>
> In addition to the experiment conditions header, there are also column
> headers.  The column headers can be multi-lined as well.  For example, you
> might have a column header whose first line contains chemical compound IDs
> or names, and the second line of the column header contains the
> concentrations for those compounds. The data values represent the percent
> inhibition at those concentrations. Like this:
>
> Erlotinib
> 1uM 10 uM 100 uM 1nM
> 0.01  0.001  0.0001 0.00001
> ...
>
> Since the position and types of header and body data vary, we typically use
>  parse configuration files that describe "what data can be found where".
>  The parse configuration varies not only per instrument but also per
> experimental protocol. So there are usually numerous configuration files in
> your typical lab.  The configuration files can also be stored in a
> database.  This is usually part of a file-watching web app.  It allows
> scientists to add support for new experiments or instruments without having
> to get a developer to write more code.
>
> In the API I saw support for hard-coded configurations via the CSVFormat
> object, but I didn't see any support for creating and using persistable
> configurations.  You may want to consider that as you move forward.
>

Thank you for taking the time to offer your point of view here.

CSVFormat implements Serializable, so you can use plain old Java
serialization, it's not human readable, but it's something.

If we moved to Java 6, we could annotate CSVFormat with JAXB so you can
have XML IO. Personally, I do not think we should do our own XML IO, so
JAXB is the best path IMO since it is built-in Java 6.

What do you currently use to parse your CSV files?

Would Commons-CSV work for you as well? If not, how so?

Would you be willing to experiment with the current code?

Thank you,
Gary


> Hope this helps,
>
> Mark
>
>
>
> On Wed, Jul 31, 2013 at 6:36 AM, Gary Gregory <garydgreg...@gmail.com
> >wrote:
>
> > On Wed, Jul 31, 2013 at 8:58 AM, Gary Gregory <garydgreg...@gmail.com
> > >wrote:
> >
> > > On Jul 31, 2013, at 3:38, Benedikt Ritter <brit...@apache.org> wrote:
> > >
> > > > 2013/7/31 Gary Gregory <garydgreg...@gmail.com>
> > > >
> > > >> On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg <ebo...@apache.org>
> > > wrote:
> > > >>
> > > >>> Le 30/07/2013 23:26, Gary Gregory a écrit :
> > > >>>> And another thing: internally, the header should be a Set<String>,
> > not
> > > >> a
> > > >>>> String[]. I plan on fixing that later too.
> > > >>>
> > > >>> Why should it be a set? Is there an impact on the performance?
> > > >>
> > > >> Well, I did not finish my though on that one, sorry about that,
> please
> > > >> allow me to walk through my use cases. The issue is about the
> feature,
> > > not
> > > >> performance.
> > > >>
> > > >> At first glance, using a set avoids an inherent problem with any
> > non-set
> > > >> data structure: defining duplicates. What does the following mean?
> > > >>
> > > >> withHeader("A", "B", "C", "A");
> > > >>
> > > >> It's is a recipe for garbage results: record.get("A") returns what?
> > > >>
> > > >> Today, I added some CSVFormat validation code that checks for
> > duplicate
> > > >> column names. If you build a format with withHeader("A", "B", "C",
> > "A");
> > > >> you will get an ISE when validate() is called.
> > > >>
> > > >> If we had withHeader(Set) and document it as the 'main' way to
> specify
> > > >> column names, then we can say that withHeader(String...) is just a
> > > >> syntactical convenience and turn the String[] into a Set. But that
> > will
> > > not
> > > >> work.
> > > >>
> > > >> The problem with a Java Set is that it is not ordered and the
> current
> > > >> implementation relies on order of the String[]. But why? What the
> > > current
> > > >> implementation says is: ignore what the header line of the file is
> and
> > > use
> > > >> the given column names at the given positions. A perfectly good user
> > > story.
> > > >> So for withHeader("A", "B", "C"), "A" is column 0, "B" is column 1,
> > and
> > > so
> > > >> on. Ok, that's one usage.
> > > >>
> > > >> Taking a step back, I want to talk about why should the column name
> > > order
> > > >> matter when you are calling withHeader(). I would like to be able to
> > > tell
> > > >> the parser that I want to use a Set of column names and have it
> figure
> > > out,
> > > >> based on the header line, the columns indices. This is quite
> different
> > > than
> > > >> what we have now.
> > > >>
> > > >> A use case I have now is a CSV file with a lot of columns (~90) but
> I
> > > only
> > > >> care about a small subset of the columns (~10). I'd like to be able
> to
> > > say
> > > >> withHeader(Set) where the Set may be a subset of the actual column
> > > names in
> > > >> the header line. This is different from withHeader(String[]) because
> > the
> > > >> names in the Set must match the names in the header record.
> > > >
> > > > I'm not sure if we should try to build in all this different cases
> > > > (guessing headers, using the first record as headers, only use a
> subset
> > > of
> > > > the available headers) into one implementation.
> > > >
> > > > What you are talking about sounds more like a view or a projection of
> > the
> > > > actual content being parsed.
> > > > Do we really need this for 1.0 or can it be postponed?
> > >
> > > This is a real scenario and a real need, not some imaginary
> complication
> > ;)
> > >
> >
> > But I could work with current framework and use withHeaders(new
> String[]{})
> > and let the parser find the headers. Then I can just do record.get("A")
> > with the columns I care about. It just feels a little more mysterious.
> >
> > I think the only wrinkle left for me is that I want validation that the
> > columns I care about are there. Right now get(String) throws
> > IllegalArgumentException if you give it an unknown column, which will
> fail
> > fast enough on the first record.
> >
> > So I'll go down that road until the next speed bump...
> >
> > Gary
> >
> >
> > >
> > > Even if it is not implemented for 1.0, we should talk about how it
> > > should be done such that it fits in and does not cause API problems
> > > later. And if I can get it done by then, then that much the better.
> > >
> > > Gary
> > >
> > > >
> > > >
> > > >>
> > > >> So I think it boils down to ignoring my comment about using a Set
> > > >> internally and adding a feature where I can tell the parser that I
> > want
> > > to
> > > >> use a set of column names and not worry about the order, because the
> > > parser
> > > >> will match up the column names when it reads the header line.
> > > >>
> > > >> Gary
> > > >>
> > > >>
> > > >>>
> > > >>>
> > > >>> Emmanuel Bourg
> > > >>>
> > > >>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > >>> For additional commands, e-mail: dev-h...@commons.apache.org
> > > >>
> > > >>
> > > >> --
> > > >> E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
> > > >> Java Persistence with Hibernate, Second Edition<
> > > >> http://www.manning.com/bauer3/>
> > > >> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> > > >> Spring Batch in Action <http://www.manning.com/templier/>
> > > >> Blog: http://garygregory.wordpress.com
> > > >> Home: http://garygregory.com/
> > > >> Tweet! http://twitter.com/GaryGregory
> > > >
> > > >
> > > >
> > > > --
> > > > http://people.apache.org/~britter/
> > > > http://www.systemoutprintln.de/
> > > > http://twitter.com/BenediktRitter
> > > > http://github.com/britter
> > >
> >
> >
> >
> > --
> > E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
> > Java Persistence with Hibernate, Second Edition<
> > http://www.manning.com/bauer3/>
> > JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> > Spring Batch in Action <http://www.manning.com/templier/>
> > Blog: http://garygregory.wordpress.com
> > Home: http://garygregory.com/
> > Tweet! http://twitter.com/GaryGregory
> >
>



-- 
E-Mail: garydgreg...@gmail.com | ggreg...@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Reply via email to