[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283883#comment-14283883 ]
Tim Allison commented on TIKA-1511: ----------------------------------- Hi [~lfcnassif], Based on your point about the tika-app's -z option and its FileEmbeddedDocumentExtractor that just copies bytes from the InputStream to a file, I propose the following. I have a strong preference to treat each table as an embedded file, but if it isn't possible, it isn't possible. So, the proposal for making use of classes that implement EmbeddedDocumentExtractor for each table: A) If the EmbeddedDocumentExtractor is a parsing EmbeddedDocumentExtractor, the correct parser will be called, and it will grab a JDBC object from the a wrapper/modification of TikaInputStream...it will not actually read the InputStream at all. The output will go into whatever handler is passed in. B) If a client reads the bytes from the input stream, they'll get a UTF-8 encoded CSV InputStream, without BLOBs and CLOBs...the EmbeddedDocumentExtractor will be called for each individual BLOB and CLOB. C) If a client uses the basic pattern of adding a Parser to the ParseContext, they'll get one big file with markup for the different <div>. D) If a client uses the RecursiveParserWrapper (not recommended for large dbs!), there will be one metadata object for each table, and one metadata object for each BLOB and CLOB...in short, potentially a large number of embedded documents. I'll mock up this plan and attach a patch if this sounds reasonable. If this does work out, we might consider refactoring the PSTParser to treat individual emails in a similar way. > Create a parser for SQLite3 > --------------------------- > > Key: TIKA-1511 > URL: https://issues.apache.org/jira/browse/TIKA-1511 > Project: Tika > Issue Type: New Feature > Components: parser > Affects Versions: 1.6 > Reporter: Luis Filipe Nassif > Fix For: 1.8 > > Attachments: TIKA-1511v1.patch, TIKA-1511v2.patch, testSQLLite3b.db > > > I think it would be very useful, as sqlite is used as data storage by a wide > range of applications. Opening the ticket to track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)