[jira] [Commented] (TIKA-93) OCR support

2014-03-27 Thread Anurag Indu (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950218#comment-13950218 ] Anurag Indu commented on TIKA-93: - Hello All, I tried to use tesseract to extract all the ima

[jira] [Assigned] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-27 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-1010: - Assignee: Tim Allison > Embedded documents in RTF are not extracted >

Re: PDF parser (two more questions)

2014-03-27 Thread Jukka Zitting
Hi, On Thu, Mar 27, 2014 at 6:21 PM, Stefano Fornari wrote: > 1. is the use of PDF2XHTML necessary? why is the pdf turned into an XHTML? > for the purpose of indexing, wouldn't just the text be enough? The XHTML output allows us to annotate the extracted text with structural information (like "t

Re: Parser.parse with file instead of stream

2014-03-27 Thread Stefano Fornari
that worked! thanks. Ste On Thu, Mar 27, 2014 at 11:24 PM, Jukka Zitting wrote: > Hi, > > On Thu, Mar 27, 2014 at 6:07 PM, Stefano Fornari > wrote: > > I am not sure tstream.hasFile() can ever be true, from my understanding > of > > the code it can be only false. > > It's true if you call the

Re: Parser.parse with file instead of stream

2014-03-27 Thread Jukka Zitting
Hi, On Thu, Mar 27, 2014 at 6:07 PM, Stefano Fornari wrote: > I am not sure tstream.hasFile() can ever be true, from my understanding of > the code it can be only false. It's true if you call the parser like this: InputStream stream = TikaInputStream.get(file); try { parser.pars

PDF parser (two more questions)

2014-03-27 Thread Stefano Fornari
Hi, I have two more questions on PDFParser: 1. is the use of PDF2XHTML necessary? why is the pdf turned into an XHTML? for the purpose of indexing, wouldn't just the text be enough? 2. I need to limit the index of the content to files whose size is below to a certain threshold; I was wondering if

Parser.parse with file instead of stream

2014-03-27 Thread Stefano Fornari
Hi All, I am using lucene in an embedded environment and I need to keep use of memory under control. In investigating a problem with big pdf files (a few Mb), I noticed that Parse.parse takes an InputStream as parameter but then PDFParser has the following code: TikaInputStream tstream = TikaInput

Re: How should video files with audio be handled by parsers?

2014-03-27 Thread Nick Burch
On Thu, 27 Mar 2014, Konstantin Gribov wrote: Some containers (like matroska/mkv) tags audio and subtitle streams with language tag and some comment. From mplayer console output: [lavf] stream 0: video (h264), -vid 0 [lavf] stream 1: audio (aac), -aid 0, -alang rus, Rus BaibaKo.tv [lavf] stream

Re: How should video files with audio be handled by parsers?

2014-03-27 Thread Konstantin Gribov
Hello, Nick. Some containers (like matroska/mkv) tags audio and subtitle streams with language tag and some comment. From mplayer console output: > [lavf] stream 0: video (h264), -vid 0 > [lavf] stream 1: audio (aac), -aid 0, -alang rus, Rus BaibaKo.tv > [lavf] stream 2: audio (ac3), -aid 1, -ala

How should video files with audio be handled by parsers?

2014-03-27 Thread Nick Burch
Hi All Does anyone know if we have a recommended way / plan of a way to handle video files with possibly multiple audio streams? Most of the multimedia container formats support video and zero or one audio streams, and a fair number support video and multiple audio streams. A few can actuall

[jira] [Commented] (TIKA-1112) Parsing for OGV file with invalid checksum

2014-03-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13949463#comment-13949463 ] Nick Burch commented on TIKA-1112: -- The checksum warning is now fixed upstream, and should

[jira] [Commented] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries

2014-03-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13949460#comment-13949460 ] Nick Burch commented on TIKA-1079: -- I think this might be the same problem as reported in