Hello,
It looks like XmlStreamReader is not correctly handling several encodings
in Commons IO 2.14.0 that previously worked in version 2.13.0.
Here's a self-contained snippet (Kotlin) that demonstrates the problem:
val xml = "Ç"
val stream = xml.byteInputStream(Charset.forName("437"))
The byte input stream does not carry any encoding information, so the
XmlStreamReader has to guess what encoding was used.
I'm surprised that it ever worked reliably.
On Tue, 3 Oct 2023 at 09:13, Laurence Gonsalves wrote:
>
> Hello,
>
> It looks like XmlStreamReader is not correctly handling sev
Feel free to provide a PR on GitHub where the unit test must fail if main
changes are not applied. You can also provide a PR that only contains a
unit test.
Gary
On Tue, Oct 3, 2023, 4:13 AM Laurence Gonsalves wrote:
> Hello,
>
> It looks like XmlStreamReader is not correctly handling several
On Tue, Oct 3, 2023 at 1:39 AM sebb wrote:
>
> The byte input stream does not carry any encoding information, so the
> XmlStreamReader has to guess what encoding was used.
Determining what encoding to use when reading XML from a byte stream
is the purpose of XmlStreamReader. From its documentatio
On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves
wrote:
>
> On Tue, Oct 3, 2023 at 1:39 AM sebb wrote:
> >
> > The byte input stream does not carry any encoding information, so the
> > XmlStreamReader has to guess what encoding was used.
>
> Determining what encoding to use when reading XML from a
Just had another look at the class: in 2.13, the regex for matching
the encoding string was
Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))",
Pattern.MULTILINE);
In 2.14, the pattern includes the following matching for the encoding:
"encoding\\s*=\\s*((?:\"[A-Za-z]([A
Although the pages I linked don't mention them, it turns out that
there is actually an alias '437', also many other numeric ones.
Indeed there are other aliases that start with a letter but otherwise
don't match the RE.
e.g. ISO_8859-1:1987
So it seems the updated RE is indeed too restrictive.
So
Thank you. My git bisect just found this change too. :-)
We are processing documents that we have no control over, and some may use
these numeric encodings, so we can't update the documents.
Looking at the XML spec
(https: //www.w3.org/TR/2008/REC-xml-20081126/#NT-EncName), it does say...
Enc
On Tue, 3 Oct 2023 at 21:35, Laurence Gonsalves
wrote:
>
> Thank you. My git bisect just found this change too. :-)
>
> We are processing documents that we have no control over, and some may use
> these numeric encodings, so we can't update the documents.
>
> Looking at the XML spec
> (https: //ww
Created https://issues.apache.org/jira/browse/IO-815
On Tue, 3 Oct 2023 at 21:49, sebb wrote:
>
> On Tue, 3 Oct 2023 at 21:35, Laurence Gonsalves
> wrote:
> >
> > Thank you. My git bisect just found this change too. :-)
> >
> > We are processing documents that we have no control over, and some m
On Tue, Oct 3, 2023 at 1:50 PM sebb wrote:
> > Given this inconsistency, and the fact that there are XML documents "in the
> > wild" that use these encoding names, would it be reasonable to relax the
> > regex
> > just enough so that it'll work with these other names and aliases?
>
> I would say
On Tue, 3 Oct 2023 at 22:17, Laurence Gonsalves
wrote:
>
> On Tue, Oct 3, 2023 at 1:50 PM sebb wrote:
> > > Given this inconsistency, and the fact that there are XML documents "in
> > > the
> > > wild" that use these encoding names, would it be reasonable to relax the
> > > regex
> > > just eno
12 matches
Mail list logo