[ 
https://issues.apache.org/jira/browse/SOLR-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724862#comment-17724862
 ] 

Thiruvalluvan M. G. edited comment on SOLR-16810 at 5/22/23 9:59 AM:
---------------------------------------------------------------------

[~gus] We are not escaping {{{}#{}}}. But we encode control characters {{0x00}} 
to {{0x1f}} as {{{}#nn;{}}}. , where {{nn}} is the decimal representation of 
the control character. The standard suggests {{{}&#nn;{}}}. But for whatever 
reason Solr team decided to use {{{}#nn;{}}}. Probably it is not intensional 
and just a bug. If want to remain backward compatible, we have to continue use 
the non-standard encoding. If we are okay to break compatibility, we can just 
fix this bug, assuming that it was not intentional. The secondary problem is 
that we don't do the reverse mapping on loading the XML. This must be fixed 
whatever path we choose.

I've explained it in the description of this ticket along with the ambiguity it 
creates. For example {{"\u0014"}} and {{"#20;"}} both get encoded as 
{{{}"#20"{}}}. If a schema had two fields with these names, then the XML has 
duplicate field name and the XML cannot be parsed any more.


was (Author: thiru_mg):
[~gus] We are not escaping {{{}#{}}}. But we encode control characters {{0x00}} 
to {{0x1f}} as {{{}#nn;{}}}. I've explained it in the description of this 
ticket along with the ambiguity it creates. For example {{"\u0014"}} and 
{{"#20;"}} both get encoded as {{{}"#20"{}}}. If a schema had two fields with 
these names, then the XML has duplicate field name and the XML cannot be parsed 
any more.

> Under certain situations Solr produces managed schema XML that cannot be 
> loaded
> -------------------------------------------------------------------------------
>
>                 Key: SOLR-16810
>                 URL: https://issues.apache.org/jira/browse/SOLR-16810
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Schema and Analysis
>    Affects Versions: 9.2.1
>            Reporter: Thiruvalluvan M. G.
>            Assignee: Ishan Chattopadhyaya
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> While persisting the {{ManagedIndexSchema}} as XML, non-printable characters 
> in field names get escaped as {{{}#nn;{}}}, where {{nn}} is the decimal 
> representation of the non-printable character. For example, if the field name 
> has the byte {{{}0x14{}}}, it gets escaped as {{{}#20;{}}}. This in 
> indistinguishable from the literal {{#20;}} in the field name. If we have two 
> fields - one with the non-printable character and the other with the literal 
> string, two fields get generated with the same name. Loading the resulting 
> XML, naturally, causes an exception. To fix this, any occurrence of literal 
> {{#}} in the field name should be escaped, with say {{{}##{}}}.
> A second problem is that while escaping happens when generating XML, the 
> corresponding unescaping does not happen on loading it. This asymmetry should 
> be fixed as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to