[
https://issues.apache.org/jira/browse/SOLR-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841513#comment-13841513
]
Uwe Schindler edited comment on SOLR-5532 at 12/6/13 6:28 PM:
--------------------------------------------------------------
Validated this with the Catilina based PANGAEA server (Oracle iPlanet
webserver):
{noformat}
VEGA:~ > curl -D - "http://ws.pangaea.de/oai/?verb=Identify"
HTTP/1.1 200 OK
Server: PANGAEA/1.0
Date: Fri, 06 Dec 2013 18:17:59 GMT
Content-type: text/xml;charset=UTF-8
Transfer-encoding: chunked
<?xml version="1.0" encoding="UTF-8"?><OAI-PMH
xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
...
{noformat}
And here is how the servlets sets Content-Type:
{code:java}
resp.setContentType("text/xml; charset="+charset);
{code}
It looks like every Tomcat does this. The reason for this is: The Content-Type
header is generally not passed unparsed to the output, the servlet container
(Catilina) does some extra parsing, to detect the charset, so when you call the
broken getWriter() that the writer has correct charset. Most webservers also do
header normalization afterwards (they combine multiple headers into one and
also remove whitespace).
The correct way to handle this is:
- Use ContentStreamBase to extract the MIME-Type and the charset from the full
Content-Type string (MIME-Type != Content-Type, that's the fault here). We have
the methods already available and they should also be available to SolrJ.
- Compare charset and MIME type with equalsIgnoreCase. But: charset does not
need to be compared. The XML parser should do this afterwards, not need to
enforce a specific charset in SolrJ. It should only enforce the MIME-Type!
was (Author: thetaphi):
Validated this with the Catilina based PANGAEA server (Oracle iPlanet
webserver):
{noformat}
VEGA:~ > curl -D - "http://ws.pangaea.de/oai/?verb=Identify"
HTTP/1.1 200 OK
Server: PANGAEA/1.0
Date: Fri, 06 Dec 2013 18:17:59 GMT
Content-type: text/xml;charset=UTF-8
Transfer-encoding: chunked
<?xml version="1.0" encoding="UTF-8"?><OAI-PMH
xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
...
{noformat}
And here is how the servlets sets Content-Type:
{code:java}
resp.setContentType("text/xml; charset="+charset);
{code}
It looks like every Tomcat does this. The reason for this is: The Content-Type
header is generally not passed unparsed to the output, the servlet container
(Catilina) does some extra parsing, to detect the charset, so when you cann the
broken getWriter() that the writer has correct charset. Most webservers also do
header normalization afterwards (they combine multiple headers into one and
also remove whitespace).
The correct way to handle this is:
- Use ContentStreamBase to extract the MIME-Type and the charset from the full
Content-Type string (MIME-Type != Content-Type, that's the fault here). We have
the methods already available and they should also be available to SolrJ.
- Compare charset and MIME type with equalsIgnoreCase. But: charset does not
need to be compared. The XML parser should do this afterwards, not need to
enforce a specific charset in SolrJ. It should only enforce the MIME-Type!
> SolrJ Content-Type validation is too strict, breaks on equivilent content
> types
> -------------------------------------------------------------------------------
>
> Key: SOLR-5532
> URL: https://issues.apache.org/jira/browse/SOLR-5532
> Project: Solr
> Issue Type: Bug
> Affects Versions: 4.6
> Environment: Windows 7, Java 1.7.0_45 (64bit), solr-solrj-4.6.0.jar
> Reporter: Jakob Furrer
> Assignee: Mark Miller
> Attachments: SOLR-5532-elyograg-eclipse-screenshot.png,
> SOLR-5532.patch
>
>
> due to SOLR-3530, HttpSolrServer now does a string equivilence check between
> the "Content-Type" returned by the server, and a getContentTYpe() method
> declared by the ResponseParser .. but string equivilence is too strict, and
> can result in errors like this one reported by a user....
> ----
> I just upgraded my Solr instance and with it I also upgraded the solrj
> library in our custom application which sends diverse requests and queries to
> Solr.
> I use the "ping" method to determine whether Solr started correctly under the
> configured address. Since the upgrade the ping response results in an error:
> {code:xml}
> Cause: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> Expected content type application/xml; charset=UTF-8 but got
> application/xml;charset=UTF-8.
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">0</int><lst name="params"><str name="df">searchtext</str><str
> name="echoParams">all</str><str name="rows">10</str><str
> name="echoParams">all</str><str name="wt">xml</str><str
> name="version">2.2</str><str name="q">solrpingquery</str><str
> name="distrib">false</str></lst></lst><str name="status">OK</str>
> </response>
> {code}
> The Solr application itself works fine.
> Using an older version of the solrj library than solr-solrj-4.6.0.jar (e.g.
> solr-solrj-4.5.1.jar) in the custom application does not produce this error.
> The Exception is produced in a Code block (_HttpSolrServer.java_, method
> _request(...)_, around. line 140) which has been introduced with version
> 4.6.0.
> Code to reproduce the error:
> {code}
> try {
> HttpSolrServer solrServer = new
> HttpSolrServer("http://localhost:8080/Solr/collection");
> solrServer.setParser(new XMLResponseParser()); // this line is making
> all the difference
> solrServer.ping();
> } catch (Exception e) {
> e.printStackTrace();
> }
> {code}
> A global search for "charset=UTF-8" on the source code of solrj indicates
> that other functions besides "ping" might be affected as well, because there
> are several places where "application/xml; charset=UTF-8" is spelled without
> a space after the semicolon.
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]