[
https://issues.apache.org/jira/browse/SOLR-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eric Pugh resolved SOLR-6475.
-----------------------------
Resolution: Won't Fix
In Solr 10 we are leveraging either Tika Server (running in it's own seperate
server process) or maybe Tika Pipes (again, running in a seperate JVM).
Please revalidate your issue against Solr 10 with one of those options, and if
it is still present need, happy to work with you on a fix using the new
approach for Tika.
> SOLR-5517 broke the ExtractingRequestHandler / Tika content-type detection.
> ---------------------------------------------------------------------------
>
> Key: SOLR-6475
> URL: https://issues.apache.org/jira/browse/SOLR-6475
> Project: Solr
> Issue Type: Bug
> Components: contrib - Solr Cell (Tika extraction)
> Affects Versions: 4.7
> Reporter: Dominik Geelen
> Priority: Major
> Labels: Content-Type, Tika, difficulty-medium, impact-medium
>
> Hi,
> as discussed with "hoss" on IRC, i'm creating this Issue about a problem we
> recently ran into:
> Our company uses Solr to index user-generated files for fulltext searching
> (PDFs, etc.) by using the ExtractingRequestHandler / Tika.
> Since we recently upgraded to Solr 4.9, the indexing process began to throw
> the following exception: "Must specify a Content-Type header with POST
> requests" (in solr/servlet/SolrRequestParsers.java, line 684 in the 4.9
> source).
> This behavior was introduced with SOLR-5517, but even as the Solr wiki
> states, Tika needs the content-type to be empty or not present to trigger
> auto detection of the content- / mime-type.
> Since both features block each other, but are basically both correct
> behavior, "hoss" suggested that Tika should be fixed to trigger the
> auto-detection on content-type "application/octet-stream" too and i highly
> agree with this proposal.
> *Test case:*
> Just use the example from the ExtractingRequestHandler wiki page:
> {noformat}
> curl
> "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text"
> --data-binary @tutorial.html [-H 'Content-type:text/html']
> {noformat}
> but don't send the content-type, obviously. or you could just use the
> "SimplePostTool (post.jar)" mentioned in the wiki, but i guess this would be
> broken now, too.
> *Proposed solution:*
> Fix the Tika content guessing in that way, that it also triggers the auto
> detection on content-type "application/octet-stream".
> Thanks,
> Dominik
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]