[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-26 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045119#comment-14045119 ] Tilman Hausherr commented on TIKA-1300: --- My impression was that the NSP had better re

[jira] [Comment Edited] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-26 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045119#comment-14045119 ] Tilman Hausherr edited comment on TIKA-1300 at 6/26/14 9:08 PM: -

[jira] [Comment Edited] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-26 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045119#comment-14045119 ] Tilman Hausherr edited comment on TIKA-1300 at 6/27/14 6:18 AM: -

[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-27 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046288#comment-14046288 ] Tilman Hausherr commented on TIKA-1300: --- I'm not doing much with text extraction, but

[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-27 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046728#comment-14046728 ] Tilman Hausherr commented on TIKA-1300: --- I had a look at most of the files. This resu

[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-28 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046896#comment-14046896 ] Tilman Hausherr commented on TIKA-1300: --- [~talli...@mitre.org] are there any "rules"

[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-29 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047095#comment-14047095 ] Tilman Hausherr commented on TIKA-1300: --- {quote} Make sure to delete handful of infec

[jira] [Created] (TIKA-1372) PDCheckbox NPE

2014-07-22 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created TIKA-1372: - Summary: PDCheckbox NPE Key: TIKA-1372 URL: https://issues.apache.org/jira/browse/TIKA-1372 Project: Tika Issue Type: Bug Reporter: Tilman Haus

[jira] [Commented] (TIKA-1372) PDCheckbox NPE

2014-07-22 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070847#comment-14070847 ] Tilman Hausherr commented on TIKA-1372: --- IMHO the cause is TIKA not doing some null c

[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-23 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145061#comment-14145061 ] Tilman Hausherr commented on TIKA-1419: --- Thanks for making these tests. Would it be p

[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-23 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145399#comment-14145399 ] Tilman Hausherr commented on TIKA-1419: --- Maybe you could create a project for GSoC201

[jira] [Updated] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-27 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1419: -- Attachment: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.xlsx Here's an excel file, on the new co

[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-29 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152855#comment-14152855 ] Tilman Hausherr commented on TIKA-1419: --- Compare PDFBox's trunk against 1.8.x periodi

[jira] [Commented] (TIKA-1427) PDF Images don't appear in structured view

2014-10-09 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165652#comment-14165652 ] Tilman Hausherr commented on TIKA-1427: --- The first image ("Im1") is painted with "q 4

[jira] [Updated] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-10-09 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1419: -- Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx Thank you [~talli...@apache.org], here's the result

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-10 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167194#comment-14167194 ] Tilman Hausherr commented on TIKA-1442: --- Do you want the junk list in some format? Ju

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-15 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172978#comment-14172978 ] Tilman Hausherr commented on TIKA-1442: --- files that have only junk as text with AR:

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-15 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx > Upgrade to PDFBox 1.8.8 > ---

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-16 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: (was: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx) > Upgrade to PDFBox 1.8.8 > -

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-16 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173983#comment-14173983 ] Tilman Hausherr commented on TIKA-1442: --- After some more research, I was able to deco

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-16 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx > Upgrade to PDFBox 1.8.8 > ---

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180302#comment-14180302 ] Tilman Hausherr commented on TIKA-1442: --- {quote} and recommend other statistics that

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180440#comment-14180440 ] Tilman Hausherr commented on TIKA-1442: --- Whats also missing this time is the token co

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180302#comment-14180302 ] Tilman Hausherr edited comment on TIKA-1442 at 10/22/14 8:06 PM:

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180446#comment-14180446 ] Tilman Hausherr commented on TIKA-1442: --- Sorry, ignore my text re: 1st line only. It'

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180469#comment-14180469 ] Tilman Hausherr commented on TIKA-1442: --- {quote} Should I add token count? {quote} Y

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180636#comment-14180636 ] Tilman Hausherr commented on TIKA-1442: --- Which are the top10words? I ask because 554/

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180687#comment-14180687 ] Tilman Hausherr commented on TIKA-1442: --- Or does the top10words mean how many stop wo

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181779#comment-14181779 ] Tilman Hausherr commented on TIKA-1442: --- Thanks! I'm slowly starting, and here's the

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181779#comment-14181779 ] Tilman Hausherr edited comment on TIKA-1442 at 10/23/14 7:31 PM:

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181813#comment-14181813 ] Tilman Hausherr commented on TIKA-1442: --- The directory structure isn't a problem for

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip I'm done now; the result is two new issues, PDFBOX-2

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182047#comment-14182047 ] Tilman Hausherr commented on TIKA-1442: --- A few files have less meta data than before:

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-24 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173983#comment-14173983 ] Tilman Hausherr edited comment on TIKA-1442 at 10/24/14 11:02 AM: ---

[jira] [Commented] (TIKA-1467) pdf:encrypted:false with encrypted pdf

2014-11-07 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202456#comment-14202456 ] Tilman Hausherr commented on TIKA-1467: --- The old and the new parser have different ap

[jira] [Comment Edited] (TIKA-1467) pdf:encrypted:false with encrypted pdf

2014-11-07 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202456#comment-14202456 ] Tilman Hausherr edited comment on TIKA-1467 at 11/7/14 10:22 PM:

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225008#comment-14225008 ] Tilman Hausherr commented on TIKA-1442: --- Thanks Tim! 892848.pdf and 892859.pdf shoul

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225008#comment-14225008 ] Tilman Hausherr edited comment on TIKA-1442 at 11/25/14 8:38 PM:

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: PDFBox_1_8_6VPDFBox_1_8_8-b145.zip > Upgrade to PDFBox 1.8.8 > --

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225283#comment-14225283 ] Tilman Hausherr commented on TIKA-1442: --- [~talli...@apache.org] I'm really wondering

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225008#comment-14225008 ] Tilman Hausherr edited comment on TIKA-1442 at 11/25/14 10:08 PM: ---

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225008#comment-14225008 ] Tilman Hausherr edited comment on TIKA-1442 at 11/25/14 11:08 PM: ---

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225867#comment-14225867 ] Tilman Hausherr commented on TIKA-1442: --- Ok, will do. About the seq vs. nonSeq test:

[jira] [Created] (TIKA-1489) PDF Text extraction without permission

2014-11-25 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created TIKA-1489: - Summary: PDF Text extraction without permission Key: TIKA-1489 URL: https://issues.apache.org/jira/browse/TIKA-1489 Project: Tika Issue Type: Bug Affec

[jira] [Commented] (TIKA-1489) PDF Text extraction without permission

2014-11-26 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226500#comment-14226500 ] Tilman Hausherr commented on TIKA-1489: --- No, permissions are connected to encryption.

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-29 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx Here's my evaluation of the test. I was

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-30 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx > Upgrade to PDFBox 1.8.8 > ---

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-30 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: (was: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx) > Upgrade to PDFBox 1.8.8 >

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-30 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228968#comment-14228968 ] Tilman Hausherr edited comment on TIKA-1442 at 11/30/14 10:49 PM: ---

[jira] [Commented] (TIKA-1489) PDF Text extraction without permission

2014-12-01 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230193#comment-14230193 ] Tilman Hausherr commented on TIKA-1489: --- [~talli...@mitre.org] I can't tell you what

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-12-01 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230589#comment-14230589 ] Tilman Hausherr commented on TIKA-1442: --- Weird thing in the 1.8.6 vs 1.8.8 test: acco

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-12-01 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230589#comment-14230589 ] Tilman Hausherr edited comment on TIKA-1442 at 12/1/14 10:44 PM:

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-12-01 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230589#comment-14230589 ] Tilman Hausherr edited comment on TIKA-1442 at 12/1/14 10:49 PM:

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-12-02 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx Thanks... one problem in both exce

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-12-03 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx I've now looked at the 1.8.6 vs 1.8.8 fil

[jira] [Commented] (TIKA-1548) System property added while catching exception on parsing PDF encrypted doc

2015-02-11 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316723#comment-14316723 ] Tilman Hausherr commented on TIKA-1548: --- Sorry, no. We're not setting that one. It is

[jira] [Commented] (TIKA-1038) Parsing PDF with StackOverlowError

2015-03-04 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347377#comment-14347377 ] Tilman Hausherr commented on TIKA-1038: --- [~talli...@mitre.org]are you watching this o

[jira] [Comment Edited] (TIKA-1038) Parsing PDF with StackOverlowError

2015-03-04 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347377#comment-14347377 ] Tilman Hausherr edited comment on TIKA-1038 at 3/4/15 6:59 PM: --

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-15 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362365#comment-14362365 ] Tilman Hausherr commented on TIKA-1575: --- {code} b) might be actual modest regressions

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-15 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362406#comment-14362406 ] Tilman Hausherr commented on TIKA-1575: --- [~talli...@apache.org] please repeat the who

[jira] [Commented] (TIKA-1174) Invalid characters in filtered PDF output

2015-03-15 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362552#comment-14362552 ] Tilman Hausherr commented on TIKA-1174: --- Can't comment, I'm not that good with font i

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-16 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363061#comment-14363061 ] Tilman Hausherr commented on TIKA-1575: --- Yes! > Upgrade to PDFBox 1.8.9 when availab

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364710#comment-14364710 ] Tilman Hausherr commented on TIKA-1575: --- Could you attach the TIKA output you get wit

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365524#comment-14365524 ] Tilman Hausherr commented on TIKA-1575: --- I can't understand how you get the extracted

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365807#comment-14365807 ] Tilman Hausherr commented on TIKA-1575: --- Can't tell, I don't know much about the stru

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365829#comment-14365829 ] Tilman Hausherr commented on TIKA-1575: --- Thanks. Re: OCR, you should know that there

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-19 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368686#comment-14368686 ] Tilman Hausherr commented on TIKA-1575: --- With the pure ExtractText, all is identical.

[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-19 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368687#comment-14368687 ] Tilman Hausherr commented on TIKA-1575: --- With the pure ExtractText, all is identical.

[jira] [Issue Comment Deleted] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-19 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1575: -- Comment: was deleted (was: With the pure ExtractText, all is identical. Could you attach the file

[jira] [Commented] (TIKA-1588) Upgrade to PDFBox 1.8.10 when available

2015-07-15 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628890#comment-14628890 ] Tilman Hausherr commented on TIKA-1588: --- The weird thing is that I can't find any dif

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-18 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632429#comment-14632429 ] Tilman Hausherr commented on TIKA-1678: --- I think this is two bytes. I.e. a 0x0 and a

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-18 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632432#comment-14632432 ] Tilman Hausherr commented on TIKA-1678: --- I get correct output for the non-XMP stuff w

[jira] [Comment Edited] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-19 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632429#comment-14632429 ] Tilman Hausherr edited comment on TIKA-1678 at 7/19/15 11:21 AM:

[jira] [Comment Edited] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-19 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632429#comment-14632429 ] Tilman Hausherr edited comment on TIKA-1678 at 7/19/15 11:22 AM:

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633687#comment-14633687 ] Tilman Hausherr commented on TIKA-1678: --- sure: {code} public class Tika1678 extends B

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633722#comment-14633722 ] Tilman Hausherr commented on TIKA-1678: --- Yes, such a string check would be useful. Or

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634045#comment-14634045 ] Tilman Hausherr commented on TIKA-1678: --- Likely a bug. I tried calling getTitele afte

[jira] [Comment Edited] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634045#comment-14634045 ] Tilman Hausherr edited comment on TIKA-1678 at 7/20/15 8:41 PM: -

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634065#comment-14634065 ] Tilman Hausherr commented on TIKA-1678: --- Yes please do and attach the file. It's late

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-22 Thread Tilman Hausherr (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637232#comment-14637232 ] Tilman Hausherr commented on TIKA-1678: --- API has changed again. This code works: {cod

[jira] [Commented] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.

2024-05-11 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845566#comment-17845566 ] Tilman Hausherr commented on TIKA-4254: --- Why would we ever run the test twice in the

[jira] [Comment Edited] (TIKA-4254) The test `TestMimeTypes#testJavaRegex` is not idempotent, as it passes in the first run and fails in repeated runs in the same environment.

2024-05-12 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845590#comment-17845590 ] Tilman Hausherr edited comment on TIKA-4254 at 5/12/24 9:40 AM:

[jira] [Updated] (TIKA-1907) Big Pdf parsing to text - Out of memory

2024-05-15 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1907: -- Fix Version/s: 3.0.0 > Big Pdf parsing to text - Out of memory > ---

[jira] [Commented] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv

2024-06-03 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851598#comment-17851598 ] Tilman Hausherr commented on TIKA-4267: --- The current version is 2.9.2, please retry

[jira] [Comment Edited] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv

2024-06-03 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851598#comment-17851598 ] Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:07 PM:

[jira] [Comment Edited] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv

2024-06-03 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851598#comment-17851598 ] Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:06 PM:

[jira] [Updated] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv

2024-06-03 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4267: -- Affects Version/s: 1.28.4 > Not getting correct mimet type for few file extensions. example :csv

[jira] [Updated] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv

2024-06-03 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4267: -- Summary: Not getting correct mime type for a few file extensions. example: csv (was: Not gettin

[jira] [Closed] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv

2024-06-10 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed TIKA-4267. - Resolution: Invalid Closing for now, please comment and/or reopen if needed. > Not getting correc

[jira] [Updated] (TIKA-4270) wrong skew angle in tika-parser-ocr-module

2024-06-20 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4270: -- Description: We use tika to extract text from different sources, including images with text tha

[jira] [Commented] (TIKA-4251) [DISCUSS] move to cosium's git-code-format-maven-plugin with google-java-format

2024-06-24 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859718#comment-17859718 ] Tilman Hausherr commented on TIKA-4251: --- I'm wondering if this means lots of changes

[jira] [Commented] (TIKA-4181) Tika Grpc Server using Tika Pipes

2024-06-30 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17861075#comment-17861075 ] Tilman Hausherr commented on TIKA-4181: --- Is this {code:xml} 3.24.0 3.24.0 {c

[jira] [Comment Edited] (TIKA-4181) Tika Grpc Server using Tika Pipes

2024-07-01 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17861075#comment-17861075 ] Tilman Hausherr edited comment on TIKA-4181 at 7/1/24 7:02 AM: -

[jira] [Commented] (TIKA-4181) Tika Grpc Server using Tika Pipes

2024-07-02 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17861363#comment-17861363 ] Tilman Hausherr commented on TIKA-4181: --- As a first step I've updated protobuf to cu

[jira] [Commented] (TIKA-4181) Tika Grpc Server using Tika Pipes

2024-07-02 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17861555#comment-17861555 ] Tilman Hausherr commented on TIKA-4181: --- PR 1849 has now succeeded. > Tika Grpc Ser

[jira] [Created] (TIKA-4274) Improve ExtractReaderException

2024-07-07 Thread Tilman Hausherr (Jira)
Tilman Hausherr created TIKA-4274: - Summary: Improve ExtractReaderException Key: TIKA-4274 URL: https://issues.apache.org/jira/browse/TIKA-4274 Project: Tika Issue Type: Improvement

[jira] [Commented] (TIKA-4274) Improve ExtractReaderException

2024-07-07 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863552#comment-17863552 ] Tilman Hausherr commented on TIKA-4274: --- new output: {noformat} INFO [pool-3-thread

[jira] [Resolved] (TIKA-4274) Improve ExtractReaderException

2024-07-07 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved TIKA-4274. --- Resolution: Fixed > Improve ExtractReaderException > -- > >

[jira] [Commented] (TIKA-4276) Tika fails to detect damaged pdf

2024-07-10 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864670#comment-17864670 ] Tilman Hausherr commented on TIKA-4276: --- Your file starts with "1 0 obj" instead of

[jira] [Updated] (TIKA-4276) Tika fails to detect damaged pdf

2024-07-10 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4276: -- Description: We use Tika to check file type and extension. However, with some damaged pdf files

  1   2   3   4   5   6   7   8   9   10   >