I'd be interesting in having some sort of flag on request, that indicated if
the incoming query was bad. I can't do a die here for legacy reasons.
jnap
On Sunday, August 2, 2015 9:39 AM, Bill Moseley <[email protected]> wrote:
BTW -- I wonder about the Catalyst behavior here.
On Sat, Aug 1, 2015 at 10:36 PM, Bill Moseley <[email protected]> wrote:
On Sat, Aug 1, 2015 at 6:31 AM, Stefan <[email protected]> wrote:
Hi,if a URL parameter contains a Unicode character (e.g.
www.example.com/?param=%D6lso%DF which stands for param=Ölsoße), the parameter
is not correctly parsed as Unicode.
One note here -- data over the wire must be encoded into octets. So, all
Unicode characters must be encoded and then decoded when received. (You can't
send "Unicode characters".) UTF-8 is used now (for obvious reasons).
http://tools.ietf.org/html/rfc3986.
You are specifying %D6 -- although the Unicode characters is U+00D6, the UTF-8
octet sequence is 0xC3 0x96. See:
http://www.fileformat.info/info/unicode/char/00D6/index.htm
Unless otherwise instructed, Catalyst uses UTF-8 as the encoding for decoding
query parameters -- query parameters are decoded from UTF-8 octets to Perl
characters.
As your example showed, if you use invalid UTF-8 sequences then
Encode::decode() as used by Catalyst will replace those with the U+FFFD
substitution character "�".
This may or may not be what you want. Personally, I think it's not correct to
silently modify user input. You intended to pass "Ölsoße" but ended up with
"�lso�e" -- is that really the data you would want to process/store for the
request? Seems unlikely.
If "param" is suppose to be passed as textual, UTF-8-encoded octets, and it
isn't, then maybe returning a 400 is a better way of handling that. That
probably would have helped you see what is wrong in this case.
i.e. use "eval { decode( $default_query_encoding, $str, FB_CROAK | LEAVE_SRC );
}" to catch invalid data and return to the client the "$str" that failed and
why.
Of course, it is also possible that you have some query parameters that you
want decoded as UTF-8 and some that might represent something else (a raw
sequence of bytes), and want more manual control. In that case
$c->config->{do_not_decode_query} could be used to bypass the decoding. But
then, you must manually decode() yourself.
--
Bill Moseley
[email protected]
_______________________________________________
List: [email protected]
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/[email protected]/
Dev site: http://dev.catalyst.perl.org/
_______________________________________________
List: [email protected]
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/[email protected]/
Dev site: http://dev.catalyst.perl.org/