[io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread Laurence Gonsalves
Hello, It looks like XmlStreamReader is not correctly handling several encodings in Commons IO 2.14.0 that previously worked in version 2.13.0. Here's a self-contained snippet (Kotlin) that demonstrates the problem: val xml = "Ç" val stream = xml.byteInputStream(Charset.forName("437"))

Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
The byte input stream does not carry any encoding information, so the XmlStreamReader has to guess what encoding was used. I'm surprised that it ever worked reliably. On Tue, 3 Oct 2023 at 09:13, Laurence Gonsalves wrote: > > Hello, > > It looks like XmlStreamReader is not correctly handling sev

Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread Gary Gregory
Feel free to provide a PR on GitHub where the unit test must fail if main changes are not applied. You can also provide a PR that only contains a unit test. Gary On Tue, Oct 3, 2023, 4:13 AM Laurence Gonsalves wrote: > Hello, > > It looks like XmlStreamReader is not correctly handling several

Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread Laurence Gonsalves
On Tue, Oct 3, 2023 at 1:39 AM sebb wrote: > > The byte input stream does not carry any encoding information, so the > XmlStreamReader has to guess what encoding was used. Determining what encoding to use when reading XML from a byte stream is the purpose of XmlStreamReader. From its documentatio

Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves wrote: > > On Tue, Oct 3, 2023 at 1:39 AM sebb wrote: > > > > The byte input stream does not carry any encoding information, so the > > XmlStreamReader has to guess what encoding was used. > > Determining what encoding to use when reading XML from a

Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
Just had another look at the class: in 2.13, the regex for matching the encoding string was Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))", Pattern.MULTILINE); In 2.14, the pattern includes the following matching for the encoding: "encoding\\s*=\\s*((?:\"[A-Za-z]([A

Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
Although the pages I linked don't mention them, it turns out that there is actually an alias '437', also many other numeric ones. Indeed there are other aliases that start with a letter but otherwise don't match the RE. e.g. ISO_8859-1:1987 So it seems the updated RE is indeed too restrictive. So

Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread Laurence Gonsalves
Thank you. My git bisect just found this change too. :-) We are processing documents that we have no control over, and some may use these numeric encodings, so we can't update the documents. Looking at the XML spec (https: //www.w3.org/TR/2008/REC-xml-20081126/#NT-EncName), it does say... Enc

Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
On Tue, 3 Oct 2023 at 21:35, Laurence Gonsalves wrote: > > Thank you. My git bisect just found this change too. :-) > > We are processing documents that we have no control over, and some may use > these numeric encodings, so we can't update the documents. > > Looking at the XML spec > (https: //ww

Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
Created https://issues.apache.org/jira/browse/IO-815 On Tue, 3 Oct 2023 at 21:49, sebb wrote: > > On Tue, 3 Oct 2023 at 21:35, Laurence Gonsalves > wrote: > > > > Thank you. My git bisect just found this change too. :-) > > > > We are processing documents that we have no control over, and some m

Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread Laurence Gonsalves
On Tue, Oct 3, 2023 at 1:50 PM sebb wrote: > > Given this inconsistency, and the fact that there are XML documents "in the > > wild" that use these encoding names, would it be reasonable to relax the > > regex > > just enough so that it'll work with these other names and aliases? > > I would say

Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
On Tue, 3 Oct 2023 at 22:17, Laurence Gonsalves wrote: > > On Tue, Oct 3, 2023 at 1:50 PM sebb wrote: > > > Given this inconsistency, and the fact that there are XML documents "in > > > the > > > wild" that use these encoding names, would it be reasonable to relax the > > > regex > > > just eno