knoobie created TIKA-4431: ----------------------------- Summary: Mime Type Detection Error with File Naming containing Number Sign Key: TIKA-4431 URL: https://issues.apache.org/jira/browse/TIKA-4431 Project: Tika Issue Type: Bug Components: core Environment: {code:xml} <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <artifactId>3.1.0</artifactId> </dependency> {code} Reporter: knoobie
I noticed that changing the file name to include a number sign / hashtag (#) changes the mime type detection. For example, "Lorem-Ipsum.csv" correctly parses to "text/csv" but once "Lorem-Ipsum#123.csv" is given (with the same file content) the parser detects "text/plain". {code:java} import static org.assertj.core.api.Assertions.assertThat; import java.nio.charset.StandardCharsets; import org.apache.tika.Tika; import org.junit.jupiter.api.Test; public class ApacheTikaTest { @Test void detect_normalFileName() { var tika = new Tika(); var fileName = "Lorem-Ipsum.csv"; var data = """ Lorem;Ipsum; 1 ;2 ; 3 ;4 ; """; assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName)) .isEqualTo("text/csv"); } @Test void detect_FileNameWithHashtag() { var tika = new Tika(); var fileName = "Lorem-Ipsum#123.csv"; var data = """ Lorem;Ipsum; 1 ;2 ; 3 ;4 ; """; assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName)) // Fails with result: 'text/plain' .isEqualTo("text/csv"); } } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)