So it turns out the main bug does not affect amd64 (or more generally, roughly
systems where (LONG_MAX>>2) >= INT_MAX).
There are at least two bugs in the tag comparison:
- Assumes number equality is the same as VALUE equality.
- Assumes hash equality is the same as string equality.
The first causes the bug: Small numbers are stored as (VALUE)((n<<1)|1);
numbers not representable as such are stored as a bignum. If you're fortunate
enough to be using amd64, the size of a VALUE (unsigned long) is sufficiently
larger than the size of a hash (int) that you can always represent it as a
"fixed" number and you never see this bug.
The second bug means sometimes <tag></othertag> will appear to be valid if tag
and othertag have the same hash and both hashes are "fixable". Searching for
hash collisions is not that difficult - the first example should work on amd64,
the second works on x86:
$ ruby -r hpricot -e "print Hpricot.XML('<afuaf></zhgaa>')"
<afuaf></zhgaa></afuaf>
$ ruby -r hpricot -e "print Hpricot.XML('<aaaayi></zkbaaa>')"
<aaaayi></aaaayi>
The patch below never bothers with the hash comparison because you have to do
the string comparison anyway. If you want to re-add the hash comparison as a
possible optimization (for e.g. the HTML case), the better way is to replace
INT2NUM with INT2FIX instead of doing a bignum comparison. (hpricot_scan.rl is
also patched, though you can just apply the same patch file to both.)
The other two bugs are still there:
- The "BogusETag" prints as </tag>, so it still produces potentially invalid
XML. It should probably print something more sensible (<tag hpricot="bogus
whatever"/>?).
- Invalid XML is still accepted in XML mode, while the correct thing to do is
fail.
There's another oddity in hpricot_scan.rl/hpricot_scan.c:
- The definition of H_ELE(N) for a cBogusETag does H_ELE_SET(ele, H_ELE_ATTR,
rb_str_new(raw, rawlen));. I'm not sure what this is for (ele.attr should be a
dict, for a start).
Anyone feel like submitting this upstream?
--- hpricot_scan.c.0 2009-07-10 13:48:13.000000000 +0100
+++ hpricot_scan.c.3 2009-07-10 14:19:16.000000000 +0100
@@ -326,11 +326,11 @@
// (see also: the search above for fixups)
//
name = INT2NUM(rb_str_hash(tag));
while (e != S->doc)
{
- if (H_ELE_GET(e, H_ELE_HASH) == name)
+ if (rb_str_cmp(H_ELE_GET(e, H_ELE_TAG),tag) == 0)
{
match = e;
break;
}
--- hpricot_scan.rl.0 2009-07-10 14:43:44.000000000 +0100
+++ hpricot_scan.rl.3 2009-07-10 14:44:07.000000000 +0100
@@ -359,11 +359,11 @@
// (see also: the search above for fixups)
//
name = INT2NUM(rb_str_hash(tag));
while (e != S->doc)
{
- if (H_ELE_GET(e, H_ELE_HASH) == name)
+ if (rb_str_cmp(H_ELE_GET(e, H_ELE_TAG),tag) == 0)
{
match = e;
break;
}
--
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]