[jira] [Commented] (TIKA-758) Address TODOs when we upgrade to next PDFBox release

2014-06-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042428#comment-14042428 ] Tim Allison commented on TIKA-758: -- Y, my grand plan after TIKA-1302 is in place would be t

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-06-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044668#comment-14044668 ] Tim Allison commented on TIKA-1302: --- Agreed. If there's a grad student with some time on

[jira] [Commented] (TIKA-1332) Create "eval" code

2014-06-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044682#comment-14044682 ] Tim Allison commented on TIKA-1332: --- To my mind, there are three families of things that

[jira] [Updated] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1300: -- Attachment: tika_1_6_ClassicsVsNonSeq.zip The attached shows the results of running Tika 1.6 trunk with

[jira] [Closed] (TIKA-1298) testEmbeddedPDFEmbeddingAnotherDocument fails with PDFBox 1.8.5 and java 1.6

2014-06-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1298. - Resolution: Fixed Turned test back on in PDFParser test. Thank you [~tilman]! > testEmbeddedPDFEmbedding

[jira] [Commented] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates

2014-06-26 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044881#comment-14044881 ] Tim Allison commented on TIKA-1233: --- Hindsight and current eval methodology turn out to b

[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-27 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045881#comment-14045881 ] Tim Allison commented on TIKA-1300: --- [~tilman], [~tboehme] and [~msahyoun], thank you all

[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046972#comment-14046972 ] Tim Allison commented on TIKA-1300: --- Don't think so. I'd recommend the 1000 zips vs 1m fi

[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-30 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047556#comment-14047556 ] Tim Allison commented on TIKA-1300: --- [~tilman], I'm sorry for not responding to your earl

[jira] [Commented] (TIKA-1364) Issue in metadata extraction for xslm (Excel Macro 2007) file

2014-07-08 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054995#comment-14054995 ] Tim Allison commented on TIKA-1364: --- Are you getting the same problem if you only include

[jira] [Closed] (TIKA-1372) PDCheckbox NPE

2014-07-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1372. - Resolution: Fixed [~tilman], thank you for notifying us. Y, that was Tika's (well, my) fault. I fixed t

[jira] [Created] (TIKA-1374) Need to add code to look for OS-specific keys for embedded files within PDFs

2014-07-24 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1374: - Summary: Need to add code to look for OS-specific keys for embedded files within PDFs Key: TIKA-1374 URL: https://issues.apache.org/jira/browse/TIKA-1374 Project: Tika

[jira] [Updated] (TIKA-1374) Need to add code to look for OS-specific keys for embedded files within PDFs

2014-07-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1374: -- Description: Embedded files in PDFs can be found by the general all purpose key we currently use via P

[jira] [Created] (TIKA-1375) Decrease memory consumption when extracting images from PDFs

2014-07-24 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1375: - Summary: Decrease memory consumption when extracting images from PDFs Key: TIKA-1375 URL: https://issues.apache.org/jira/browse/TIKA-1375 Project: Tika Issue Type

[jira] [Created] (TIKA-1376) Improve embedded file name extraction in PDFParser

2014-07-24 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1376: - Summary: Improve embedded file name extraction in PDFParser Key: TIKA-1376 URL: https://issues.apache.org/jira/browse/TIKA-1376 Project: Tika Issue Type: Improveme

[jira] [Commented] (TIKA-1375) Decrease memory consumption when extracting images from PDFs

2014-07-25 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074306#comment-14074306 ] Tim Allison commented on TIKA-1375: --- I ran four versions of Tika against a random selecti

[jira] [Comment Edited] (TIKA-1375) Decrease memory consumption when extracting images from PDFs

2014-07-25 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074306#comment-14074306 ] Tim Allison edited comment on TIKA-1375 at 7/25/14 11:43 AM: - I

[jira] [Comment Edited] (TIKA-1375) Decrease memory consumption when extracting images from PDFs

2014-07-25 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074306#comment-14074306 ] Tim Allison edited comment on TIKA-1375 at 7/25/14 11:49 AM: - I

[jira] [Closed] (TIKA-1375) Decrease memory consumption when extracting images from PDFs

2014-07-25 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1375. - Resolution: Fixed fixed r1613395. Thank you, [~tilman] and [~lehmi] for your work on PDFBOX-2101 and advi

[jira] [Closed] (TIKA-1376) Improve embedded file name extraction in PDFParser

2014-07-25 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1376. - Resolution: Fixed r1613444 > Improve embedded file name extraction in PDFParser > ---

[jira] [Closed] (TIKA-1374) Need to add code to look for OS-specific keys for embedded files within PDFs

2014-07-25 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1374. - Resolution: Fixed r1613501. > Need to add code to look for OS-specific keys for embedded files within PDF

[jira] [Commented] (TIKA-1380) Upgrade to Apache POI 3.11 beta 1

2014-08-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14082488#comment-14082488 ] Tim Allison commented on TIKA-1380: --- A test in CLI needs a small change: {noformat} testE

[jira] [Updated] (TIKA-1380) Upgrade to Apache POI 3.11 beta 1

2014-08-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1380: -- Attachment: TIKA-1380b.patch This includes Nick's patch, the small change in CLITest and a small change

[jira] [Commented] (TIKA-1372) PDCheckbox NPE

2014-08-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14082662#comment-14082662 ] Tim Allison commented on TIKA-1372: --- Looked at the stacktrace a bit more closely. This w

[jira] [Commented] (TIKA-1380) Upgrade to Apache POI 3.11 beta 1

2014-08-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084537#comment-14084537 ] Tim Allison commented on TIKA-1380: --- Thank you, Nick! I just noticed that we should bump

[jira] [Commented] (TIKA-1380) Upgrade to Apache POI 3.11 beta 1

2014-08-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084583#comment-14084583 ] Tim Allison commented on TIKA-1380: --- Added simple tests for comments in xls and xlsx: r16

[jira] [Commented] (TIKA-1380) Upgrade to Apache POI 3.11 beta 1

2014-08-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084817#comment-14084817 ] Tim Allison commented on TIKA-1380: --- [~gagravarr] and all, would you have an objection to

[jira] [Commented] (TIKA-1380) Upgrade to Apache POI 3.11 beta 1

2014-08-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084820#comment-14084820 ] Tim Allison commented on TIKA-1380: --- Also, [~gagravarr], should we bump codec to 1.9 to s

[jira] [Updated] (TIKA-1317) Tika does not read text from Headers, Cover Pages, and SDT components of DOCX documents

2014-08-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1317: -- Attachment: TIKA-1317.patch If there are no objections, I'll commit this to the 1.6 branch and trunk sh

[jira] [Closed] (TIKA-1317) Tika does not read text from Headers, Cover Pages, and SDT components of DOCX documents

2014-08-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1317. - Resolution: Fixed Fix Version/s: 1.6 Committed in trunk: r1615667 Committed in 1.6 branch: r1615675

[jira] [Comment Edited] (TIKA-1380) Upgrade to Apache POI 3.11 beta 1

2014-08-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084820#comment-14084820 ] Tim Allison edited comment on TIKA-1380 at 8/4/14 9:05 PM: --- Also,

[jira] [Commented] (TIKA-1275) Upgrade Commons compress to 1.8.1

2014-08-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086123#comment-14086123 ] Tim Allison commented on TIKA-1275: --- I just tested on trunk, and all tests pass once we a

[jira] [Commented] (TIKA-1275) Upgrade Commons compress to 1.8.1

2014-08-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086143#comment-14086143 ] Tim Allison commented on TIKA-1275: --- Sounds good. I added a tukaani.version param next t

[jira] [Reopened] (TIKA-1380) Upgrade to Apache POI 3.11 beta 1

2014-08-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1380: --- The 1.6 branch and trunk are failing one test on Windows. testExtract(org.apache.tika.cli.TikaCLITest): F

[jira] [Commented] (TIKA-1275) Upgrade Commons compress to 1.8.1

2014-08-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086162#comment-14086162 ] Tim Allison commented on TIKA-1275: --- Got it. Thank you. > Upgrade Commons compress to 1

[jira] [Commented] (TIKA-1380) Upgrade to Apache POI 3.11 beta 1

2014-08-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086189#comment-14086189 ] Tim Allison commented on TIKA-1380: --- Something along these lines: {noformat} if (type ==

[jira] [Commented] (TIKA-1380) Upgrade to Apache POI 3.11 beta 1

2014-08-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086213#comment-14086213 ] Tim Allison commented on TIKA-1380: --- In POI 3.10-final, this particular attachment threw

[jira] [Resolved] (TIKA-1275) Upgrade Commons compress to 1.8.1

2014-08-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1275. --- Resolution: Fixed Fix Version/s: 1.6 upgraded in 1.6 branch: 1615923 in trunk: 1615926 > Upgra

[jira] [Updated] (TIKA-1380) Upgrade to Apache POI 3.11 beta 1

2014-08-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1380: -- Attachment: TIKA-1380_nullOLELabel.patch Proposed modification. > Upgrade to Apache POI 3.11 beta 1 > -

[jira] [Resolved] (TIKA-1380) Upgrade to Apache POI 3.11 beta 1

2014-08-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1380. --- Resolution: Fixed patch applied in: 1.6 branch: r1615970 trunk: r1615980 > Upgrade to Apache POI 3.11

[jira] [Updated] (TIKA-1329) Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser

2014-08-11 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1329: -- Attachment: test_recursive_embedded.docx TIKA-1329v2.patch Got this error on review boar

[jira] [Updated] (TIKA-1329) Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser

2014-08-11 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1329: -- Description: Jukka and Nick have a great demo of parsing metadata recursively on the [wiki|http://wiki.

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

2014-08-14 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14097278#comment-14097278 ] Tim Allison commented on TIKA-1396: --- In 1.5, Tika only extracts "attachments" from pdfs.

[jira] [Updated] (TIKA-1396) Embedded images in PDF documents

2014-08-14 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1396: -- Component/s: (was: cli) parser > Embedded images in PDF documents > ---

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

2014-08-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099306#comment-14099306 ] Tim Allison commented on TIKA-1396: --- Latest app build is available [here|https://builds.

[jira] [Resolved] (TIKA-1396) Embedded images in PDF documents

2014-08-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1396. --- Resolution: Fixed Fix Version/s: 1.6 Feature available in 1.6. > Embedded images in PDF docume

[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2014-09-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119745#comment-14119745 ] Tim Allison commented on TIKA-1330: --- Looks like ballpark estimate on time for processing

[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2014-09-04 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121454#comment-14121454 ] Tim Allison commented on TIKA-1330: --- Started documentation on the [wiki|https://wiki.apac

[jira] [Updated] (TIKA-1232) Add PDF version to PDFParser output

2014-09-08 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1232: -- Fix Version/s: 1.6 > Add PDF version to PDFParser output > --- > >

[jira] [Commented] (TIKA-1268) Extract images from PDF documents

2014-09-10 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128398#comment-14128398 ] Tim Allison commented on TIKA-1268: --- These should do it, no? Either with svn commandline

[jira] [Comment Edited] (TIKA-1268) Extract images from PDF documents

2014-09-10 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128398#comment-14128398 ] Tim Allison edited comment on TIKA-1268 at 9/10/14 12:13 PM: - T

[jira] [Commented] (TIKA-1414) How to extract embedded images from PDFs?

2014-09-12 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131386#comment-14131386 ] Tim Allison commented on TIKA-1414: --- >From TIKA-1396: bq. As a hack, you can also change

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

2014-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133832#comment-14133832 ] Tim Allison commented on TIKA-1396: --- I just tested the tika 1.6 app jar on "testPDF_child

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

2014-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133841#comment-14133841 ] Tim Allison commented on TIKA-1396: --- Now that we are using PDFBox 1.8.6, we might conside

[jira] [Commented] (TIKA-1414) How to extract embedded images from PDFs?

2014-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133851#comment-14133851 ] Tim Allison commented on TIKA-1414: --- [~tpalsulich], any interest in adding an example for

[jira] [Commented] (TIKA-1414) How to extract embedded images from PDFs?

2014-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134700#comment-14134700 ] Tim Allison commented on TIKA-1414: --- [~tpalsulich], great. Thank you! It might make sen

[jira] [Commented] (TIKA-1414) How to extract embedded images from PDFs?

2014-09-15 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134702#comment-14134702 ] Tim Allison commented on TIKA-1414: --- [~damiano], I'll close this out shortly unless I hea

[jira] [Created] (TIKA-1418) Add TikaConfigDumperExample to example package

2014-09-18 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1418: - Summary: Add TikaConfigDumperExample to example package Key: TIKA-1418 URL: https://issues.apache.org/jira/browse/TIKA-1418 Project: Tika Issue Type: New Feature

[jira] [Updated] (TIKA-1418) Add TikaConfigDumperExample to example package

2014-09-18 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1418: -- Attachment: TikaConfigDumper.patch > Add TikaConfigDumperExample to example package > ---

[jira] [Updated] (TIKA-1418) Add TikaConfigDumperExample to example package

2014-09-19 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1418: -- Attachment: tika-config-SNAPSHOT-1.7_20140919.xml For posterity, this is what the tika-config file looks

[jira] [Resolved] (TIKA-1418) Add TikaConfigDumperExample to example package

2014-09-19 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1418. --- Resolution: Fixed Fix Version/s: 1.7 Added the example and test. I also added the --config= opt

[jira] [Created] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-19 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1419: - Summary: Upgrade to PDFBox 1.8.7 Key: TIKA-1419 URL: https://issues.apache.org/jira/browse/TIKA-1419 Project: Tika Issue Type: Improvement Reporter: Ti

[jira] [Resolved] (TIKA-1329) Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser

2014-09-19 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1329. --- Resolution: Fixed r1626300 > Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataPars

[jira] [Updated] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1419: -- Attachment: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv > Upgrade to PDFBox 1.8.7 >

[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143588#comment-14143588 ] Tim Allison commented on TIKA-1419: --- I just finished the run on 50,000 random pdfs from g

[jira] [Created] (TIKA-1424) Clear PDFont's resources after each file to prevent memory leak

2014-09-22 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1424: - Summary: Clear PDFont's resources after each file to prevent memory leak Key: TIKA-1424 URL: https://issues.apache.org/jira/browse/TIKA-1424 Project: Tika Issue T

[jira] [Comment Edited] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143588#comment-14143588 ] Tim Allison edited comment on TIKA-1419 at 9/23/14 1:23 AM: I j

[jira] [Created] (TIKA-1426) Let's allow users to specify a tika config file on the commandline for tika-app and tika-server

2014-09-22 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1426: - Summary: Let's allow users to specify a tika config file on the commandline for tika-app and tika-server Key: TIKA-1426 URL: https://issues.apache.org/jira/browse/TIKA-1426

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

2014-09-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145369#comment-14145369 ] Tim Allison commented on TIKA-1396: --- Thank you for attaching a test file! I'll take a lo

[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145381#comment-14145381 ] Tim Allison commented on TIKA-1419: --- Yes, absolutely. I'm sorry for appearing to be (wel

[jira] [Reopened] (TIKA-1396) Embedded images in PDF documents

2014-09-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1396: --- > Embedded images in PDF documents > > > Key: TIKA-1396 >

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

2014-09-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145420#comment-14145420 ] Tim Allison commented on TIKA-1396: --- When I run your file through a modified version of a

[jira] [Comment Edited] (TIKA-1396) Embedded images in PDF documents

2014-09-23 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145420#comment-14145420 ] Tim Allison edited comment on TIKA-1396 at 9/23/14 8:53 PM: Whe

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

2014-09-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146230#comment-14146230 ] Tim Allison commented on TIKA-1396: --- Ah, ok. Y, pls open another issue. I should also a

[jira] [Closed] (TIKA-1396) Embedded images in PDF documents

2014-09-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1396. - Resolution: Not a Problem > Embedded images in PDF documents > > >

[jira] [Resolved] (TIKA-1297) Images not being extracted from PDFs

2014-09-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1297. --- Resolution: Fixed Fix Version/s: 1.6 > Images not being extracted from PDFs > --

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-09-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146283#comment-14146283 ] Tim Allison commented on TIKA-1422: --- While work is going on to get the TesseractOCRParser

[jira] [Resolved] (TIKA-1424) Clear PDFont's resources after each file to prevent memory leak

2014-09-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1424. --- Resolution: Fixed r1627304 > Clear PDFont's resources after each file to prevent memory leak > ---

[jira] [Resolved] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1419. --- Resolution: Fixed r1627308 > Upgrade to PDFBox 1.8.7 > --- > > Key

[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146294#comment-14146294 ] Tim Allison commented on TIKA-1419: --- Happy to help (and again my apologies for the post-h

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-09-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146537#comment-14146537 ] Tim Allison commented on TIKA-1422: --- Sorry, user error. Needed to force update. Thank y

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

2014-09-24 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146588#comment-14146588 ] Tim Allison commented on TIKA-1396: --- Y, I can think of a few options. We still need to a

[jira] [Updated] (TIKA-1330) Add robust tika-batch code

2014-09-25 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1330: -- Attachment: TIKA-1330v1-patch.zip This is the first version of tika-batch. Much cleanup remains. This f

[jira] [Comment Edited] (TIKA-1330) Add robust tika-batch code

2014-09-25 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121454#comment-14121454 ] Tim Allison edited comment on TIKA-1330 at 9/25/14 4:18 PM: Sta

[jira] [Commented] (TIKA-1330) Add robust tika-batch code

2014-09-25 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147922#comment-14147922 ] Tim Allison commented on TIKA-1330: --- [~tilman], I leave it as an exercise to implement a

[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-29 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151614#comment-14151614 ] Tim Allison commented on TIKA-1419: --- Thank you! Let me know when I should run 1.8.8 v. 1

[jira] [Created] (TIKA-1433) Extract documents embedded within annotations in PDFs

2014-09-29 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1433: - Summary: Extract documents embedded within annotations in PDFs Key: TIKA-1433 URL: https://issues.apache.org/jira/browse/TIKA-1433 Project: Tika Issue Type: New Fe

[jira] [Resolved] (TIKA-1433) Extract documents embedded within annotations in PDFs

2014-09-29 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1433. --- Resolution: Fixed r1628350 > Extract documents embedded within annotations in PDFs > -

[jira] [Resolved] (TIKA-1414) How to extract embedded images from PDFs?

2014-09-29 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1414. --- Resolution: Not a Problem > How to extract embedded images from PDFs? > ---

[jira] [Resolved] (TIKA-1427) PDF Images don't appear in structured view

2014-09-29 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1427. --- Resolution: Fixed r1628354. Let me know if the markup is sufficient for your needs. > PDF Images don'

[jira] [Reopened] (TIKA-1427) PDF Images don't appear in structured view

2014-09-30 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1427: --- Assignee: Tim Allison Will modify to behave exactly as msoffice PDF Images don't appear in structure

[jira] [Resolved] (TIKA-1427) PDF Images don't appear in structured view

2014-10-01 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1427. --- Resolution: Fixed r1628707. Made inline image tags equivalent to those created by Word parser. Let me

[jira] [Commented] (TIKA-1427) PDF Images don't appear in structured view

2014-10-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158170#comment-14158170 ] Tim Allison commented on TIKA-1427: --- We're currently iterating through the images once we

[jira] [Commented] (TIKA-1427) PDF Images don't appear in structured view

2014-10-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158223#comment-14158223 ] Tim Allison commented on TIKA-1427: --- On at least one test doc, I'm getting correct behavi

[jira] [Comment Edited] (TIKA-1427) PDF Images don't appear in structured view

2014-10-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158223#comment-14158223 ] Tim Allison edited comment on TIKA-1427 at 10/3/14 5:33 PM: On

[jira] [Comment Edited] (TIKA-1437) encoding issue in AutoDetectReader

2014-10-06 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160312#comment-14160312 ] Tim Allison edited comment on TIKA-1437 at 10/6/14 2:04 PM: No

[jira] [Commented] (TIKA-1437) encoding issue in AutoDetectReader

2014-10-06 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14160312#comment-14160312 ] Tim Allison commented on TIKA-1437: --- No encoding detector will be perfect. Are you sur

[jira] [Commented] (TIKA-1439) PDF embeded with document can not parse.

2014-10-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165065#comment-14165065 ] Tim Allison commented on TIKA-1439: --- Hi [~sunxingzhe359], Thanks to your post with test

[jira] [Updated] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-10-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1419: -- Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.zip [~tilman], sorry for my delay. This contrasts Tika 1.7 trunk

[jira] [Commented] (TIKA-1427) PDF Images don't appear in structured view

2014-10-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165592#comment-14165592 ] Tim Allison commented on TIKA-1427: --- Hmmm...I'm not able to grab the wmf embedded image f

  1   2   3   4   5   6   7   8   9   10   >