Thanks for your long answer, Aristotle. Much appreciated.

I read it carefully. I also listened to the speech that Jed Lund proposed I listened to.

I think Ricardo Signes (in that speech) is right: to avoid confusion, it's best to decode byte strings as soon as we store them in variables, as early as possible. And encode them to UTF-8 as late as possible, just before we send them out and store them. This way the programmer can be fairly certain that all variables he holds throughout his entire program contain character strings.

I think I'm going to go with that convention, and let the programmer use an optional parameter in my methods, to ask XML::MyXML's methods to process or produce XML documents as byte strings instead of characters.

After all, it's possible for a webserver using Mojolicious or other framework to receive an XML document containing <?encoding="UTF-8"?> declaration, but still, the web developer will be receiving a character string (containing that declaration).

I hope I'm not making a mistake here... I'm following Ricardo's convention... it's a good thing to conform... my module's documentation is going to be simple (rather than having to explain to the user the difference between XML segments and XML documents)... and I'm giving the user the liberty to process byte strings if they choose.

Thanks for listening to me,

- Alexander


On 05/07/16 08:00, Aristotle Pagaltzis wrote:
* Alexander Karelas <ak...@zoo.gr> [2016-07-04 21:48]:
The same question applies to parsing: should the XML documents that
the module parses be byte strings or character strings?
An XML document must be bytes, because it specifies its encoding in
the <?xml?> at the top (even if only implicitly) and that makes no sense
any other way.

But an XML fragment must be characters because text in XML is Unicode
and fragments do not have an encoding.

But this gets a little metaphysical when you deal with concrete data
because the XML PI is optional. You can’t distinguish XML fragments from
XML documents just by looking at them.

It’s like a string that sticks to ASCII: is that bytes or characters?
The distinction is not in the data, it’s in programmer intent behind the
code that handles the data… but you have to keep that in mind to write
code that actually works correctly. (Which is to say we’re talking about
types. The type is not in the data. This is where an actual type system
helps – having one means you can express that concretely.)

So the I-don’t-believe-in-abstractions answer is… just allow the user to
get the data as both characters and bytes, and make them say which one.
For that case I would argue that the default ought to be bytes.

The more abstractionista answer would be if the user can ask for a node
to be rendered as an XML fragment; in that case, to get characters they
must ask for the document element rendered to a string, and if they ask
for the whole document they always get bytes.

Regards,

Reply via email to