knoobie created TIKA-4431:
-----------------------------

             Summary: Mime Type Detection Error with File Naming containing 
Number Sign 
                 Key: TIKA-4431
                 URL: https://issues.apache.org/jira/browse/TIKA-4431
             Project: Tika
          Issue Type: Bug
          Components: core
         Environment: {code:xml}
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-core</artifactId>
      <artifactId>3.1.0</artifactId>
    </dependency>
{code}
            Reporter: knoobie


I noticed that changing the file name to include a number sign / hashtag (#) 
changes the mime type detection.

For example, "Lorem-Ipsum.csv" correctly parses to "text/csv" but once 
"Lorem-Ipsum#123.csv" is given (with the same file content) the parser detects 
"text/plain".

 
{code:java}
import static org.assertj.core.api.Assertions.assertThat;

import java.nio.charset.StandardCharsets;
import org.apache.tika.Tika;
import org.junit.jupiter.api.Test;

public class ApacheTikaTest {

  @Test
  void detect_normalFileName() {
    var tika = new Tika();
    var fileName = "Lorem-Ipsum.csv";
    var data = """
     Lorem;Ipsum;
      1    ;2    ;
      3    ;4    ;
      """;

    assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName))
      .isEqualTo("text/csv");
  }

  @Test
  void detect_FileNameWithHashtag() {
    var tika = new Tika();
    var fileName = "Lorem-Ipsum#123.csv";
    var data = """
      Lorem;Ipsum;
      1    ;2    ;
      3    ;4    ;
      """;

    assertThat(tika.detect(data.getBytes(StandardCharsets.UTF_8), fileName))
      // Fails with result: 'text/plain'
      .isEqualTo("text/csv");  
   }
}
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to