The character set used by Windows is not the same as UTF-8. That causes
problems when you feed Windows text into an interface that is expecting
UTF-8. I know it drives me crazy.

If you pull up a web page that is in French, and check the page encoding in
your browser, you can try changing it from UTF-8 to Windows or vice versa.
You should see that the accented characters change, so you'll have an
example in front of you.

The browser will typically render the page according to the character set
specified in the HTML header (I think), or it makes a best guess, or it uses
its default. Although this only affects the rendering of the page, so far as
the browser is concerned, it does affect copy and paste. If you copy from a
page that is rendered in the Windows character set, and paste it into an
interface (even another browser window) that is UTF-8, then you'll get
unexpected (garbage) characters.

The same thing applies with editors. Although even Notepad allows saving a
file as UTF-8, I don't know what that accomplishes because it doesn't
actually do any character translation.

To make matters worse, a console window uses (by default) yet another
character set (ANSI).

In any case, what I have been doing with my applications is to translate the
incoming text from Windows to UTF-8. First, though, I check to see if the
text is already UTF-8 by doing a dummy translation from UTF-8 to UTF-8; if
the results are unchanged, then I know that particular text was already
UTF-8 and that it shouldn't be remapped.

You will also run into this problem if you copy and paste from a PDF, I
suspect.

This whole thing gives me a headache. I hope someone else who really
understands this stuff will respond, so we can both learn.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341


> -----Original Message-----
> From: Amer Neely [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, February 13, 2007 3:53 AM
> To: mysql@lists.mysql.com
> Subject: Re: Removing space characters ... char(160)? ... char(194)?
>
> > Hi all.
> >
> > I'm trying to weed out garbage that comes from copying and
> pasting stuff
> > from a web page.
> >
> > Some of the data has spaces, but a *different* kind of space ... a
> > char(160) kind ... I think ... I figured this out by
> copying the space
> > character and pasting it into mysql thus:
> >
> > select ascii(' ');
> >
> >  ... where the space was pasted in.
> >
> > So I'm using:
> >
> > update tmp_AAPT_OnlineAnalyser_ChargeTypeSummary set
> Service_Number =
> > replace( Service_Number, char(160), '' );
> >
> >  ... but this returns:
> >
> > Query OK, 0 rows affected (0.00 sec)
> > Rows matched: 313  Changed: 0  Warnings: 0
> >
> > So it's not finding char(160) in Service_Number. If I try
> another way to
> > get at the space character, I get a different result:
> >
> > select ascii( right( Service_Number, 1 ) ) from
> > tmp_AAPT_OnlineAnalyser_ChargeTypeSummary;
> >
> >  ... gives me a big set of results, all 194 ( ie char(194)
> ). But when I
> > compare both the characters:
> >
> > select char(160), char(194);
> >
> >  ... I get:
> >
> > +-----------+-----------+
> > | char(160) | char(194) |
> > +-----------+-----------+
> > | <A0>         | <C2>         |
> > +-----------+-----------+
> >
> >  ... and both the <A0> and <C2> results are in reverse
> video. The <A0>
> > *looks* like the stuff I'm getting at the end of fields
> when I just do a
> > select from the table in the MySQL command-line client, eg the 1st
> > record has Service_Number:
> >
> > 0298437600<A0>
> >  ( <A0> is reversed ).
> >
> > Lastly, maybe I shouldn't add this, but when I construct the space
> > character from a Perl app running under Windows 2000:
> >
> > my $space_character = chr(160);
>
> When I do: perl -e "print chr(160);"
> I get: รก
>
> This is also with Win2K and ActiveState.
>
> I've been following several threads on character sets and
> collation as
> well. I have a database that contains accented data (Canadian French)
> that doesn't render correctly in a browser window. I'm going to try
> converting it and the tables to utf8 Unicode. Then make sure the
> character set for the HTML is also utf8.
>
> >
> > and then insert it into the SQL:
> >
> > my $sql = "update tmp_AAPT_OnlineAnalyser_ChargeTypeSummary set
> > Service_Number = replace( Service_Number, '" .
> $space_character . "', '' )";
> >
> > it works! But the *exact* same Perl code running on a Linux
> client fails
> > ( doesn't update the field anyway ). It defies logic.
> >
> > Who knows what's going on?
>
>
> --
> Amer Neely
> w: www.softouch.on.ca/
> b: www.softouch.on.ca/blog/
> Perl | MySQL programming for all data entry forms.
> "We make web sites work!"
>
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe:
> http://lists.mysql.com/[EMAIL PROTECTED]
>
>




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to