tsdb extraction

2018-03-28 Thread Oleg Tikhonov
Hi guys, I am wondering if we have a parser which can deal with time series, like influxDB or Prometheus? May be you know such "work in progress" - it's also good. Thanks in advance, Oleg

[jira] [Commented] (TIKA-2618) LabelRecord and LabelSSTRecord text can be overwritten in xls

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418040#comment-16418040 ] Hudson commented on TIKA-2618: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #12 (See

[jira] [Commented] (TIKA-2618) LabelRecord and LabelSSTRecord text can be overwritten in xls

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418034#comment-16418034 ] Hudson commented on TIKA-2618: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1461 (See [h

RE: 1.18 pre rc regression tests

2018-03-28 Thread Allison, Timothy B.
Still waiting for reports... We've had quite a few files go from application/x-123 to image/x-tga via TIKA-2527. I think this is expected because they all appear to be embedded files, with file names that end in .tga. But I wanted to confirm this is expected. There's also one example of: appli

[jira] [Commented] (TIKA-2618) LabelRecord and LabelSSTRecord text can be overwritten in xls

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417998#comment-16417998 ] Hudson commented on TIKA-2618: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #224 (S

[jira] [Resolved] (TIKA-2618) LabelRecord and LabelSSTRecord text can be overwritten in xls

2018-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2618. --- Resolution: Fixed Fix Version/s: 2.0.0 1.18 > LabelRecord and LabelSSTRecord

[jira] [Created] (TIKA-2618) LabelRecord and LabelSSTRecord text can be overwritten in xls

2018-03-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2618: - Summary: LabelRecord and LabelSSTRecord text can be overwritten in xls Key: TIKA-2618 URL: https://issues.apache.org/jira/browse/TIKA-2618 Project: Tika Issue Typ

[jira] [Commented] (TIKA-2617) Ignore NPOIFS IOOBE in PPT attachments

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417900#comment-16417900 ] Hudson commented on TIKA-2617: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #11 (See

[jira] [Commented] (TIKA-2614) RFC822 treats non-multipart as attachment

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417898#comment-16417898 ] Hudson commented on TIKA-2614: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #11 (See

[jira] [Commented] (TIKA-2616) message/news now incorrectly identified as rfc822

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417899#comment-16417899 ] Hudson commented on TIKA-2616: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #11 (See

[jira] [Commented] (TIKA-2617) Ignore NPOIFS IOOBE in PPT attachments

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417822#comment-16417822 ] Hudson commented on TIKA-2617: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #223 (S

[jira] [Commented] (TIKA-2614) RFC822 treats non-multipart as attachment

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417770#comment-16417770 ] Hudson commented on TIKA-2614: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1460 (See [h

[jira] [Commented] (TIKA-2616) message/news now incorrectly identified as rfc822

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417771#comment-16417771 ] Hudson commented on TIKA-2616: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1460 (See [h

[jira] [Commented] (TIKA-2617) Ignore NPOIFS IOOBE in PPT attachments

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417772#comment-16417772 ] Hudson commented on TIKA-2617: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1460 (See [h

[jira] [Resolved] (TIKA-2579) Update to PDFBox 2.0.9 when available

2018-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2579. --- Resolution: Fixed Fix Version/s: 2.0.0 1.18 > Update to PDFBox 2.0.9 when ava

[jira] [Resolved] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0

2018-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2607. --- Resolution: Fixed Fix Version/s: 2.0.0 1.18 > Exchange levigo-jbig2-imageio w

[jira] [Resolved] (TIKA-2617) Ignore NPOIFS IOOBE in PPT attachments

2018-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2617. --- Resolution: Fixed Fix Version/s: 2.0.0 1.18 > Ignore NPOIFS IOOBE in PPT atta

[jira] [Resolved] (TIKA-2616) message/news now incorrectly identified as rfc822

2018-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2616. --- Resolution: Fixed Fix Version/s: 2.0.0 1.18 > message/news now incorrectly id

[jira] [Resolved] (TIKA-2614) RFC822 treats non-multipart as attachment

2018-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2614. --- Resolution: Fixed Fix Version/s: 2.0.0 1.18 > RFC822 treats non-multipart as

[jira] [Commented] (TIKA-2616) message/news now incorrectly identified as rfc822

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417708#comment-16417708 ] Hudson commented on TIKA-2616: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #222 (S

[jira] [Commented] (TIKA-2614) RFC822 treats non-multipart as attachment

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417707#comment-16417707 ] Hudson commented on TIKA-2614: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #222 (S

[jira] [Commented] (TIKA-2569) Grouped Text boxes in .ppt

2018-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417703#comment-16417703 ] Tim Allison commented on TIKA-2569: --- Whoa! This added a huge amount of newly extracted t

[jira] [Commented] (TIKA-2617) Ignore NPOIFS IOOBE in PPT attachments

2018-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417688#comment-16417688 ] Tim Allison commented on TIKA-2617: --- e.g. govdocs1/206/206668.ppt and govdocs1/164/164761

[jira] [Created] (TIKA-2617) Ignore NPOIFS IOOBE in PPT attachments

2018-03-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2617: - Summary: Ignore NPOIFS IOOBE in PPT attachments Key: TIKA-2617 URL: https://issues.apache.org/jira/browse/TIKA-2617 Project: Tika Issue Type: Task Repo

[jira] [Commented] (TIKA-2579) Update to PDFBox 2.0.9 when available

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417577#comment-16417577 ] Hudson commented on TIKA-2579: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #221 (S

[jira] [Commented] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417578#comment-16417578 ] Hudson commented on TIKA-2607: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #221 (S

Re: message/news; charset=windows-1252 -> message/rfc822

2018-03-28 Thread Chris Mattmann
+1 From: Nick Burch Reply-To: "dev@tika.apache.org" Date: Wednesday, March 28, 2018 at 8:01 AM To: "dev@tika.apache.org" Subject: Re: message/news; charset=windows-1252 -> message/rfc822 On Wed, 28 Mar 2018, Allison, Timothy B. wrote: With the new mime patterns, we've gotten quite

[jira] [Resolved] (TIKA-2615) mbox incorrectly identified as RFC822

2018-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2615. --- Resolution: Duplicate Oops...these are {{message/news}}...duplicate issue... I think. > mbox incorrect

[jira] [Created] (TIKA-2616) message/news identified as rfc822

2018-03-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2616: - Summary: message/news identified as rfc822 Key: TIKA-2616 URL: https://issues.apache.org/jira/browse/TIKA-2616 Project: Tika Issue Type: Task Reporter:

[jira] [Updated] (TIKA-2616) message/news now incorrectly identified as rfc822

2018-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2616: -- Summary: message/news now incorrectly identified as rfc822 (was: message/news identified as rfc822) > m

[jira] [Commented] (TIKA-2579) Update to PDFBox 2.0.9 when available

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417510#comment-16417510 ] Hudson commented on TIKA-2579: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1459 (See [h

[jira] [Commented] (TIKA-2607) Exchange levigo-jbig2-imageio with pdfbox-jbig2-imageio:3.0.0

2018-03-28 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417511#comment-16417511 ] Hudson commented on TIKA-2607: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1459 (See [h

[jira] [Created] (TIKA-2615) mbox incorrectly identified as RFC822

2018-03-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2615: - Summary: mbox incorrectly identified as RFC822 Key: TIKA-2615 URL: https://issues.apache.org/jira/browse/TIKA-2615 Project: Tika Issue Type: Task Repor

Re: message/news; charset=windows-1252 -> message/rfc822

2018-03-28 Thread Nick Burch
On Wed, 28 Mar 2018, Allison, Timothy B. wrote: With the new mime patterns, we've gotten quite a few changes of message/news being identified as message/rfc822. An example is: http://162.242.228.174/docs/commoncrawl2/DA/DALFSFPD6FX4GGZ6EEJQA6RABA7OXIF5

1.18 pre rc regression tests

2018-03-28 Thread Allison, Timothy B.
All, I've run the initial regression tests. The corpus size is now big enough that I have to migrate the H2 tables to postgres before writing the reports. I'll post the reports as soon as they're finally ready, but I'm starting to go through some results now. Cheers, Tim

message/news; charset=windows-1252 -> message/rfc822

2018-03-28 Thread Allison, Timothy B.
All, With the new mime patterns, we've gotten quite a few changes of message/news being identified as message/rfc822. An example is: http://162.242.228.174/docs/commoncrawl2/DA/DALFSFPD6FX4GGZ6EEJQA6RABA7OXIF5 We sh

[jira] [Updated] (TIKA-2614) RFC822 treats non-multipart as attachment

2018-03-28 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2614: -- Attachment: TIKA-2614-from-common-crawl.txt > RFC822 treats non-multipart as attachment > ---

[jira] [Created] (TIKA-2614) RFC822 treats non-multipart as attachment

2018-03-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2614: - Summary: RFC822 treats non-multipart as attachment Key: TIKA-2614 URL: https://issues.apache.org/jira/browse/TIKA-2614 Project: Tika Issue Type: Task R