[jira] [Updated] (TIKA-3700) DefaultZipContainerDetector fails to recognize .docx file

Jira Fri, 11 Mar 2022 15:29:07 -0800


     [ 
https://issues.apache.org/jira/browse/TIKA-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michał Ruszkowski updated TIKA-3700:
------------------------------------
    Description: 
Hello,

Recently my team upgraded from Tika 1.x to 2.3 due to vulnerability and I 
noticed problem with file type detection based on content.
 * we have simple test that calls method 

{code:java}
tika.getDetector().detect(tikaInputStream, metadata);{code}
 * the file that we create inputStream from is placed inside _/test/resources_ 
and it is *.docx*
 * the detector method DefaultZipContainerDetector.detect() returns 
application/x-tika-ooxml when we run mvn install
 * following test was working with Tika 1.x
 * we have dependencies in pom.xml _*tika-core*_ and 
_*tika-parsers-standard-package*_           

The most strange is the fact that the same test run successfully through 
IntelliJ 'Run Test...' button.
 * I tried using UTF-8 encoding in maven's pom.xml as well as using parameter 
-Dfile.encoding=UTF-8 while install with no success.
 * I compared content of files in boths cases (successfull test and failed one) 
and they look almost the same, however in one case whitespaces seems to be 
bigger. Don't know if it can make a difference, but here is example content of 
file that is properly detected: 

{code:java}
�l�������:0Tɭ�"Э�p'䧘 ��tn��&� q(=X�� ��!.���,�_�WF�L8W()���u{code}
 
and here is the same line of content that fails (notice additional whitespace 
before 'q(='
{code:java}
�l�������:0Tɭ�"Э�p'䧘 ��tn��&�  q(=X�� ��!.���,�_�WF�L8W()���u {code}
 * I just checked and it works fine with Tika 2.2.1

  was:
Hello,

Recently my team upgraded from Tika 1.x to 2.3 due to vulnerability and I 
noticed problem with file type detection based on content.
 * we have simple test that calls method 

{code:java}
tika.getDetector().detect(tikaInputStream, metadata);{code}
 * the file that we create inputStream from is placed inside _/test/resources_ 
and it is *.docx*
 * the detector method DefaultZipContainerDetector.detect() returns ... when we 
run mvn install
 * following test was working with Tika 1.x
 * we have dependencies in pom.xml _*tika-core*_ and 
_*tika-parsers-standard-package*_           

The most strange is the fact that the same test run successfully through 
IntelliJ 'Run Test...' button.
 * I tried using UTF-8 encoding in maven's pom.xml as well as using parameter 
-Dfile.encoding=UTF-8 while install with no success.
 * I compared content of files in boths cases (successfull test and failed one) 
and they look almost the same, however in one case whitespaces seems to be 
bigger. Don't know if it can make a difference, but here is example content of 
file that is properly detected: 

{code:java}
�l�������:0Tɭ�"Э�p'䧘 ��tn��&� q(=X�� ��!.���,�_�WF�L8W()���u{code}
 
and here is the same line of content that fails (notice additional whitespace 
before 'q(='
{code:java}
�l�������:0Tɭ�"Э�p'䧘 ��tn��&�  q(=X�� ��!.���,�_�WF�L8W()���u {code}
 * I just checked and it works fine with Tika 2.2.1


> DefaultZipContainerDetector fails to recognize .docx file
> ---------------------------------------------------------
>
>                 Key: TIKA-3700
>                 URL: https://issues.apache.org/jira/browse/TIKA-3700
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 2.3.0
>         Environment: Ubuntu + mvn 3.6.3 + java 8
>            Reporter: Michał Ruszkowski
>            Priority: Major
>
> Hello,
> Recently my team upgraded from Tika 1.x to 2.3 due to vulnerability and I 
> noticed problem with file type detection based on content.
>  * we have simple test that calls method 
> {code:java}
> tika.getDetector().detect(tikaInputStream, metadata);{code}
>  * the file that we create inputStream from is placed inside 
> _/test/resources_ and it is *.docx*
>  * the detector method DefaultZipContainerDetector.detect() returns 
> application/x-tika-ooxml when we run mvn install
>  * following test was working with Tika 1.x
>  * we have dependencies in pom.xml _*tika-core*_ and 
> _*tika-parsers-standard-package*_           
> The most strange is the fact that the same test run successfully through 
> IntelliJ 'Run Test...' button.
>  * I tried using UTF-8 encoding in maven's pom.xml as well as using parameter 
> -Dfile.encoding=UTF-8 while install with no success.
>  * I compared content of files in boths cases (successfull test and failed 
> one) and they look almost the same, however in one case whitespaces seems to 
> be bigger. Don't know if it can make a difference, but here is example 
> content of file that is properly detected: 
> {code:java}
> �l�������:0Tɭ�"Э�p'䧘 ��tn��&� q(=X�� ��!.���,�_�WF�L8W()���u{code}
>  
> and here is the same line of content that fails (notice additional whitespace 
> before 'q(='
> {code:java}
> �l�������:0Tɭ�"Э�p'䧘 ��tn��&�  q(=X�� ��!.���,�_�WF�L8W()���u {code}
>  * I just checked and it works fine with Tika 2.2.1



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (TIKA-3700) DefaultZipContainerDetector fails to recognize .docx file

Reply via email to