Sandro Lackner created TIKA-4394: ------------------------------------ Summary: DXF files with a comment in the first two lines are sometimes detected as text/plain Key: TIKA-4394 URL: https://issues.apache.org/jira/browse/TIKA-4394 Project: Tika Issue Type: Bug Affects Versions: 3.1.0, 2.9.3 Reporter: Sandro Lackner Attachments: testDXF_long_top_comment-1.dxf, testDXF_no_comment.dxf, testDXF_original_comment.dxf, testDXF_short_top_comment.dxf, testDXF_shorter_top_comment.dxf
When working with DXF files i realized that Tika fails to detect the correct MIME type for some of them. Instead of "image/vnd.dxf;format=ascii" Tika detects them as "text/plain". After some analysis I found out that longer comments in the beginning of the DXF files lead to this behaviour. Attached you'll find my test files from which only the "shorter" and "no_comment" ones are detected correctly. The "original" DXF file contains the comment which lead to this finding in the first place. We got it from a customer who uses the GEONIS framework for DXF file creation, which is a ESRI ArcGIS-Desktop extension. It seems to use the netDXF framework for creation, which writes this comment in the first line. {noformat} Dxf file generated by netDxf https://netdxf.codeplex.com, Copyright(C) 2009-2016 Daniel Carvajal, Licensed under LGPL {noformat} Looking through the Tika GitHub I realized that the tika-mimetypes.xml contains a regular expression in the magic match for the mime type "image/vnd.dxf;format=ascii" that limits the length of a comment at the beginning of the file to 64 characters. This regex was added for TIKA-3550 with the following [commit|https://github.com/apache/tika/commit/e9f36cb7425f25a768eb58936505277ccbecfedf]. In my opinion the limit of 64 characters should be removed and replaced by a more fitting regular expression for ascii dxf file detection. -- This message was sent by Atlassian Jira (v8.20.10#820010)