[jira] [Created] (TIKA-3488) Security issue XXE in TIKA due to JDOM

2021-07-21 Thread Arvind Jagtap (Jira)
Arvind Jagtap created TIKA-3488: --- Summary: Security issue XXE in TIKA due to JDOM Key: TIKA-3488 URL: https://issues.apache.org/jira/browse/TIKA-3488 Project: Tika Issue Type: Bug Com

[jira] [Created] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Sebastian Nagel (Jira)
Sebastian Nagel created TIKA-3489: - Summary: Robots.txt files frequently identified as message/rfc822 Key: TIKA-3489 URL: https://issues.apache.org/jira/browse/TIKA-3489 Project: Tika Issue T

[jira] [Updated] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated TIKA-3489: -- Affects Version/s: 2.0.0 > Robots.txt files frequently identified as message/rfc822 > --

[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384905#comment-17384905 ] Tim Allison commented on TIKA-3489: --- Should we try to detect robots.txt files as their o

[jira] [Commented] (TIKA-3153) Text File identified as message/rfc822

2021-07-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384913#comment-17384913 ] Sebastian Nagel commented on TIKA-3153: --- Wasn't this already resolved in 1.25? {nof

[jira] [Commented] (TIKA-2443) Plain text file identified as rfc822 and which can cause StackOverflowError

2021-07-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384915#comment-17384915 ] Sebastian Nagel commented on TIKA-2443: --- Looks like this was already resolved in 1.2

[jira] [Resolved] (TIKA-2443) Plain text file identified as rfc822 and which can cause StackOverflowError

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2443. --- Fix Version/s: 1.25 Resolution: Fixed Thank you [~snagel]! > Plain text file identified as rfc

[jira] [Resolved] (TIKA-3153) Text File identified as message/rfc822

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-3153. --- Fix Version/s: 1.25 Resolution: Fixed > Text File identified as message/rfc822 > --

[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384931#comment-17384931 ] Tim Allison commented on TIKA-3489: --- [~nick], any recommendations? {{text/x-robots}} sub

[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384966#comment-17384966 ] Tim Allison commented on TIKA-3489: --- I added mime detection for robots.txt in {{main}} w

Interesting PDF on stackoverflow

2021-07-21 Thread Tim Allison
https://stackoverflow.com/questions/68402058/tika-isnt-reading-pdf-properly Not sure there's much we should do on the Tika side. How hard would it be to add an "extract only text that is on the page" feature? Best, Tim

[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384992#comment-17384992 ] Sebastian Nagel commented on TIKA-3489: --- The [robots.txt RFC draft|https://datatrack

Re: Interesting PDF on stackoverflow

2021-07-21 Thread Tilman Hausherr
Maybe this could be done with the ExtractTextByArea example. However IIRC the coordinates are awt-like (y 0 on top) coordinates, so the PDF coordinates should somehow be mapped to this. Tilman Am 21.07.2021 um 18:21 schrieb Tim Allison: https://stackoverflow.com/questions/68402058/tika-isnt-r

[jira] [Commented] (TIKA-3489) Robots.txt files frequently identified as message/rfc822

2021-07-21 Thread Hudson (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385061#comment-17385061 ] Hudson commented on TIKA-3489: -- FAILURE: Integrated in Jenkins build Tika ยป tika-main-jdk8 #2

[jira] [Created] (TIKA-3490) Fix serialization in opensearch emitter for embedded documents

2021-07-21 Thread Tim Allison (Jira)
Tim Allison created TIKA-3490: - Summary: Fix serialization in opensearch emitter for embedded documents Key: TIKA-3490 URL: https://issues.apache.org/jira/browse/TIKA-3490 Project: Tika Issue Ty

[jira] [Updated] (TIKA-3490) Fix serialization in opensearch emitter for embedded documents

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3490: -- Description: Serialization isn't working for embedded documents in the OpenSearch emitter. This fix is

[jira] [Updated] (TIKA-3483) Implement a network policy for Helm Chart

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3483: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Implement a network policy for Helm Char

[jira] [Updated] (TIKA-3454) Facilitate configuration of translation and transcription impls in tika-server/tika-docker/tika-helm

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3454: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Facilitate configuration of translation

[jira] [Updated] (TIKA-3452) java.nio.file.FileSystemException Read-only file system in 2.0.0-BETA tika-docker

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3452: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > java.nio.file.FileSystemException Read-o

[jira] [Updated] (TIKA-3400) Use equals for Object and String Comparison Instead of ==

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3400: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Use equals for Object and String Compari

[jira] [Updated] (TIKA-3404) Rearchitect GoogleTranslator to use https://github.com/googleapis/java-translate

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3404: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Rearchitect GoogleTranslator to use > h

[jira] [Updated] (TIKA-3003) Remove unused dependencies

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3003: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Remove unused dependencies > ---

[jira] [Updated] (TIKA-3348) Improve the workflow for extracting and returning images from PDFs and other containers using Tika Server..

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3348: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Improve the workflow for extracting and

[jira] [Updated] (TIKA-3420) Set tesseract ocr langauges as docker build args

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3420: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Set tesseract ocr langauges as docker bu

[jira] [Updated] (TIKA-2945) AutoDetectParser should skip the content type detection if Metadata already has it

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2945: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > AutoDetectParser should skip the content

[jira] [Updated] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3368: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Add Bill of Materials (BOM) artifact (Ti

[jira] [Updated] (TIKA-2758) Possible error charset detection

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2758: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Possible error charset detection > -

[jira] [Updated] (TIKA-3367) Add Bill of Materials (BOM) artifact

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3367: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Add Bill of Materials (BOM) artifact > -

[jira] [Updated] (TIKA-2796) Update GoogleTranslator to use google-cloud-translate Java API

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2796: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Update GoogleTranslator to use google-cl

[jira] [Updated] (TIKA-3270) Render non-text in PDFs for OCR

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3270: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Render non-text in PDFs for OCR > --

[jira] [Updated] (TIKA-3314) Treat soft hyphens like hyphens

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3314: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Treat soft hyphens like hyphens > --

[jira] [Updated] (TIKA-2623) get embedded resources in PDF/doc files

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2623: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > get embedded resources in PDF/doc files

[jira] [Updated] (TIKA-2794) Tika extracts text from pdf on MacBook, but not windows server.,

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2794: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Tika extracts text from pdf on MacBook,

[jira] [Updated] (TIKA-2346) Allow Office format parsers to exclude parsing shapes

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2346: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Allow Office format parsers to exclude p

[jira] [Updated] (TIKA-2946) Review how TikaConfig can avoid parsing XML itself

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2946: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Review how TikaConfig can avoid parsing

[jira] [Updated] (TIKA-2701) Text is not extracted properly from WMF files

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2701: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Text is not extracted properly from WMF

[jira] [Updated] (TIKA-2711) When parsing a UNIX text file apostrophes are rendered as ?

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2711: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > When parsing a UNIX text file apostrophe

[jira] [Updated] (TIKA-2720) A parser to output universal sentence encodings to text

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2720: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > A parser to output universal sentence en

[jira] [Updated] (TIKA-2492) Remove pdfdebugger from tika

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2492: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Remove pdfdebugger from tika > -

[jira] [Updated] (TIKA-2346) Allow Office format parsers to exclude parsing shapes

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2346: -- Fix Version/s: 2.0.1 > Allow Office format parsers to exclude parsing shapes > -

[jira] [Updated] (TIKA-2596) Make PDF2XHTML and AbstractPDF2XHTML public classes

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2596: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Make PDF2XHTML and AbstractPDF2XHTML pub

[jira] [Updated] (TIKA-2565) Upgrade edu.ucar dependencies to 4.6.11

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2565: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Upgrade edu.ucar dependencies to 4.6.11

[jira] [Updated] (TIKA-2312) [Mp3Parser] expose fields form ID3TagsAndAudio

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2312: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > [Mp3Parser] expose fields form ID3TagsAn

[jira] [Updated] (TIKA-2558) Add a new pid api to Tika

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2558: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Add a new pid api to Tika >

[jira] [Updated] (TIKA-2071) Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2071: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Tika 2.0 - DefaultParser and CompositeParser

[jira] [Updated] (TIKA-2340) Add explicit deps to tika-parsers which are currently used from transitive scope

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2340: -- Fix Version/s: 2.0.1 > Add explicit deps to tika-parsers which are currently used from transitive > sco

[jira] [Updated] (TIKA-2639) Update freedesktop.org shared-mime-info-spec hyperlink in MimeTypesReader.java

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2639: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Update freedesktop.org shared-mime-info-

[jira] [Updated] (TIKA-1988) Age Detection Tika Recogniser

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1988: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Age Detection Tika Recogniser >

[jira] [Updated] (TIKA-1988) Age Detection Tika Recogniser

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1988: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Age Detection Tika Recogniser > -

[jira] [Updated] (TIKA-2312) [Mp3Parser] expose fields form ID3TagsAndAudio

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2312: -- Fix Version/s: 2.0.1 > [Mp3Parser] expose fields form ID3TagsAndAudio > ---

[jira] [Updated] (TIKA-2542) Support in tika-server for getting plain text and metadata at the same time

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2542: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Support in tika-server for getting plain

[jira] [Updated] (TIKA-1829) org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92) NPE

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1829: -- Fix Version/s: (was: 2.0.0) 2.0.1 > org.apache.tika.parser.ocr.TesseractOCRParser

[jira] [Updated] (TIKA-1697) Parser Implementation for AkomaNtoso Legal XML Documents

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1697: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Parser Implementation for AkomaNtoso Leg

[jira] [Updated] (TIKA-1953) tika-server NullPointerException while processing rtfs

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1953: -- Fix Version/s: (was: 2.0.0) 2.0.1 > tika-server NullPointerException while proces

[jira] [Updated] (TIKA-2369) Define a clean Recogniser interface: for objects from binary data; and for text classification

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2369: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Define a clean Recogniser interface: for

[jira] [Updated] (TIKA-3104) Detection of memgraph files exported from Xcode

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3104: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Detection of memgraph files exported fro

[jira] [Updated] (TIKA-1724) Create parser for .obo file format.

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1724: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Create parser for .obo file format. > ---

[jira] [Updated] (TIKA-2369) Define a clean Recogniser interface: for objects from binary data; and for text classification

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2369: -- Fix Version/s: 2.0.1 > Define a clean Recogniser interface: for objects from binary data; and for > tex

[jira] [Updated] (TIKA-1688) Tika Version in Metadata

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1688: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Tika Version in Metadata > --

[jira] [Updated] (TIKA-1808) Head section closed too eager

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1808: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Head section closed too eager > -

[jira] [Updated] (TIKA-1709) Tika Server doesn't handle multi-part attachments or form-encoded inputs

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1709: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Tika Server doesn't handle multi-part at

[jira] [Updated] (TIKA-1840) No way to link slide notes to slide in PPT output.

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1840: -- Fix Version/s: (was: 2.0.0) 2.0.1 > No way to link slide notes to slide in PPT ou

[jira] [Updated] (TIKA-1808) Head section closed too eager

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1808: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Head section closed too eager >

[jira] [Updated] (TIKA-1829) org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92) NPE

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1829: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > org.apache.tika.parser.ocr.TesseractOCRP

[jira] [Updated] (TIKA-1709) Tika Server doesn't handle multi-part attachments or form-encoded inputs

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1709: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Tika Server doesn't handle multi-part attachm

[jira] [Updated] (TIKA-2071) Tika 2.0 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2071: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Tika 2.0 - DefaultParser and CompositePa

[jira] [Updated] (TIKA-1724) Create parser for .obo file format.

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1724: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Create parser for .obo file format. > --

[jira] [Updated] (TIKA-1953) tika-server NullPointerException while processing rtfs

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1953: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > tika-server NullPointerException while p

[jira] [Updated] (TIKA-1840) No way to link slide notes to slide in PPT output.

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1840: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > No way to link slide notes to slide in P

[jira] [Updated] (TIKA-1705) Update ASM dependency to 5.0.4

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1705: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Update ASM dependency to 5.0.4 >

[jira] [Updated] (TIKA-1395) Create embedded image extraction example

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1395: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Create embedded image extraction example > --

[jira] [Updated] (TIKA-2340) Add explicit deps to tika-parsers which are currently used from transitive scope

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2340: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Add explicit deps to tika-parsers which

[jira] [Updated] (TIKA-1454) Extracting as HTML loses links in xlsx, ppt, and pptx files

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1454: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Extracting as HTML loses links in xlsx, ppt,

[jira] [Updated] (TIKA-1640) Make ExternalParser support aliases for key names in extracted metadata

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1640: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Make ExternalParser support aliases for

[jira] [Updated] (TIKA-1609) Leverage Google's LibPhonenumber for enhanced phone number extraction and metadata modeling

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1609: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Leverage Google's LibPhonenumber for enh

[jira] [Updated] (TIKA-1688) Tika Version in Metadata

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1688: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Tika Version in Metadata > -

[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1607: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Introduce new arbitrary object key/value

[jira] [Updated] (TIKA-1505) chmparser breaks down when extracting from file of CHM format v3

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1505: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > chmparser breaks down when extracting fr

[jira] [Updated] (TIKA-1738) ForkClient does not always delete temporary bootstrap jar

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1738: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > ForkClient does not always delete tempor

[jira] [Updated] (TIKA-1390) Create tika-example module

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1390: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Create tika-example module > ---

[jira] [Updated] (TIKA-1456) Visual Sentiment API parser

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1456: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Visual Sentiment API parser > --

[jira] [Updated] (TIKA-1598) Parser Implementation for Streaming Video

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1598: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Parser Implementation for Streaming Video > -

[jira] [Updated] (TIKA-1674) Add example to show how to extract embedded files

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1674: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Add example to show how to extract embedded f

[jira] [Updated] (TIKA-1417) Create Extract Embedded Images from PDFs Example

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1417: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Create Extract Embedded Images from PDFs

[jira] [Updated] (TIKA-1697) Parser Implementation for AkomaNtoso Legal XML Documents

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1697: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Parser Implementation for AkomaNtoso Legal XM

[jira] [Updated] (TIKA-1465) Implement extraction of non-global variables from netCDF3 and netCDF4

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1465: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Implement extraction of non-global varia

[jira] [Updated] (TIKA-1609) Leverage Google's LibPhonenumber for enhanced phone number extraction and metadata modeling

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1609: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Leverage Google's LibPhonenumber for enhanced

[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1276: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Missing embedded dependencies in tika-bundle

[jira] [Updated] (TIKA-1952) Access Date is getting modified while capturing the MetaData information using AutoDetectParser

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1952: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Access Date is getting modified while ca

[jira] [Updated] (TIKA-1738) ForkClient does not always delete temporary bootstrap jar

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1738: -- Fix Version/s: (was: 2.0.0) 2.0.1 > ForkClient does not always delete temporary b

[jira] [Updated] (TIKA-1366) Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1366: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Update some of Tika Server services to s

[jira] [Updated] (TIKA-1674) Add example to show how to extract embedded files

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1674: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Add example to show how to extract embed

[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1607: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Introduce new arbitrary object key/values dat

[jira] [Updated] (TIKA-1705) Update ASM dependency to 5.0.4

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1705: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Update ASM dependency to 5.0.4 > ---

[jira] [Updated] (TIKA-1640) Make ExternalParser support aliases for key names in extracted metadata

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1640: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Make ExternalParser support aliases for key n

[jira] [Updated] (TIKA-1328) Translate Metadata and Content

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1328: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Translate Metadata and Content >

[jira] [Updated] (TIKA-1616) Tika Parser for GIBS Metadata

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1616: -- Fix Version/s: (was: 2.0.0) 2.0.0-BETA > Tika Parser for GIBS Metadata >

[jira] [Updated] (TIKA-1417) Create Extract Embedded Images from PDFs Example

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1417: -- Fix Version/s: (was: 2.0.0) 2.0.1 > Create Extract Embedded Images from PDFs Exam

[jira] [Updated] (TIKA-1800) MediaType#parse does not decode escaped special characters

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1800: -- Fix Version/s: (was: 2.0.0) 2.0.1 > MediaType#parse does not decode escaped speci

[jira] [Updated] (TIKA-1577) NetCDF Data Extraction

2021-07-21 Thread Tim Allison (Jira)
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1577: -- Fix Version/s: (was: 2.0.0) 2.0.1 > NetCDF Data Extraction >

  1   2   >