[
https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Giuseppe Totaro updated TIKA-1580:
----------------------------------
Attachment: TIKA-1580.v03.Mattmann.Totaro.03262015.patch
Hi all, I uploaded a new patch
({{TIKA-1580.v03.Mattmann.Totaro.03262015.patch}}) including a parser for
ISATab archive.
In particular, this patch includes a new {{ISArchiveParser}} java class that
leverages on {{ISATabUtils}} static methods. {{ISATabUtils}} is a utility class
that provides methods for parsing investigation, study, and assay files.
{{ISArchiveParser}} runs over study files. It starts from the given study file
and looks for the related investigation and assay files in the same directory.
Mimetype detection is provided also for investigation and assay files.
Thanks [~chrismattmann] for helping me on this stuff.
> ISA-Tab parsers
> ---------------
>
> Key: TIKA-1580
> URL: https://issues.apache.org/jira/browse/TIKA-1580
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Giuseppe Totaro
> Assignee: Chris A. Mattmann
> Priority: Minor
> Labels: new-parser
> Fix For: 1.8
>
> Attachments: TIKA-1580.Mattmann.Totaro.032515.patch.txt,
> TIKA-1580.patch, TIKA-1580.v02.patch,
> TIKA-1580.v03.Mattmann.Totaro.03262015.patch
>
>
> We are going to add parsers for ISA-Tab data formats.
> ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help
> to manage an increasingly diverse set of life science, environmental and
> biomedical experiments that employing one or a combination of technologies.
> The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular
> format. Therefore, ISA-Tab data format includes three types of file:
> Investigation file ({{a_xxxx.txt}}), Study file ({{s_xxxx.txt}}), Assay file
> ({{a_xxxx.txt}}). These files are organized as [top-down
> hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation
> file includes one or more Study files: each Study files includes one or more
> Assay files.
> Essentially, the Investigation files contains high-level information about
> the related study, so it provides only metadata about ISA-Tab files.
> More details on file format specification are [available
> online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf].
> The patch in attachment provides a preliminary version of ISA-Tab parsers
> (there are three parsers; one parser for each ISA-Tab filetype):
> * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts
> only metadata.
> * {{ISATabStudyParser.java}}: parses Study files.
> * {{ISATabAssayParser.java}}: parses Assay files.
> The most important improvements are:
> * Combine these three parsers in order to parse an ISArchive
> * Provide a better mapping of both study and assay data on XHML. Currently,
> {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping
> function relying on [Apache Commons
> CSV|https://commons.apache.org/proper/commons-csv/].
> Thanks for supporting me on this work [~chrismattmann].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)