Sandro Lackner created TIKA-4394:
------------------------------------

             Summary: DXF files with a comment in the first two lines are 
sometimes detected as text/plain
                 Key: TIKA-4394
                 URL: https://issues.apache.org/jira/browse/TIKA-4394
             Project: Tika
          Issue Type: Bug
    Affects Versions: 3.1.0, 2.9.3
            Reporter: Sandro Lackner
         Attachments: testDXF_long_top_comment-1.dxf, testDXF_no_comment.dxf, 
testDXF_original_comment.dxf, testDXF_short_top_comment.dxf, 
testDXF_shorter_top_comment.dxf

When working with DXF files i realized that Tika fails to detect the correct 
MIME type for some of them. Instead of "image/vnd.dxf;format=ascii" Tika 
detects them as "text/plain".

After some analysis I found out that longer comments in the beginning of the 
DXF files lead to this behaviour. Attached you'll find my test files from which 
only the "shorter" and "no_comment" ones are detected correctly.
The "original" DXF file contains the comment which lead to this finding in the 
first place. We got it from a customer who uses the GEONIS framework for DXF 
file creation, which is a ESRI ArcGIS-Desktop extension. It seems to use the 
netDXF framework for creation, which writes this comment in the first line.
{noformat}
Dxf file generated by netDxf https://netdxf.codeplex.com, Copyright(C) 
2009-2016 Daniel Carvajal, Licensed under LGPL
{noformat}

Looking through the Tika GitHub I realized that the tika-mimetypes.xml contains 
a regular expression in the magic match for the mime type 
"image/vnd.dxf;format=ascii" that limits the length of a comment at the 
beginning of the file to 64 characters.
This regex was added for TIKA-3550 with the following 
[commit|https://github.com/apache/tika/commit/e9f36cb7425f25a768eb58936505277ccbecfedf].

In my opinion the limit of 64 characters should be removed and replaced by a 
more fitting regular expression for ascii dxf file detection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to