On Tue, Feb 17, 2009 at 4:26 PM, mike <mike...@gmail.com> wrote:
> i tried that kind of stuff - it did not seem to work.
>
> i will try again... if anyone has any ideas i.e. "use iconv to convert
> to A, then use DOM stuff, then use iconv to move it back to UTF8..."
> etc. i am all ears.

Nope - for example this is the input text (apologies if your reader
isn't utf-8) - simplified chinese

足以概括英特尔为此所付出的努力。谈及移动设备,英特尔公司自诩在该领域的创新犹如其户友好性设计及能效等一样出类拔萃。同时,英特尔也一直表示要帮助构建能够

Output is this:

&auml;&cedil;&#128;&aring;&#143;&yen;&ldquo;&egrave;&#139;&plusmn;&ccedil;&#137;&sup1;&aring;&deg;&#148;&ccedil;&#131;&shy;&egrave;&iexcl;&middot;&auml;&ordm;&#142;&ccedil;&sect;&raquo;&aring;&#138;&u

What is funny is I don't care about altering the actual content, only
the content of the "href" and "src" attributes, which are all standard
latin-based URLs, too.

Here's the simplest code to create the behavior

$q = db_query("SELECT id,old FROM testing", "redirects");
while(list($id, $doc) = db_rows($q)) {
        $new = fix_document($doc);
        $new = db_escape($new);
        db_query("UPDATE testing SET new='$new' WHERE id=$id",
"redirects");
}
db_free($q);

function fix_document($string) {
        $dom = new DomDocument('1.0', 'UTF-8');
        @$dom->loadHTML($string);
        $dom->preserveWhiteSpace = false;
        return $dom->saveHTML();
}

(Note: it is not the db functions, if I do this:

function fix_document($string) {
        return $string;
}

The content is unaltered.

Anyone with any ideas? Any options to feed to the DOM stuff? It's
translating the stuff to htmlentities, which I don't want either.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to