Re: [Tutor] unicode decode/encode issue

Steven D'Aprano Mon, 26 Sep 2016 10:43:22 -0700

On Mon, Sep 26, 2016 at 12:59:04PM -0400, bruce wrote:

> When I look at the input content, I have :
> 
>  u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
> 
> So, any pointers on replacing the \u2013 with a simple '-' (dash) (or I
> could even handle just a ' ' (space)


You misinterpret what you see. \u2013 *is* a dash (its an en-dash):

py> import unicodedata
py> unicodedata.name(u'\u2013')
'EN DASH'

Try printing the string, and you will see what it looks like:

py> content = u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
py> print content
English 120 Course Syllabus – Fall – 2006


Python strings include a lot of escape codes. Simple byte strings 
include:

\t tab
\n newline
\r carriage return
\0 ASCII null byte
etc.

plus escape codes for hex codes:

\xDD (two digit hex code, between hex 00 and hex FF)

That lets you enter any byte between (decimal) 0 and 255. For example:

\x20

is the hex code 20 (decimal 32), which is a space:

py> '\x20' == ' '
True


Unicode strings allow the same escape codes as byte strings, plus 
special Unicode escape codes:

\uDDDD (four digit hex codes, for codes between 0 and 65535)

\UDDDDDDDD (eight digit hex codes, for codes between 0 and 1114111)

\N{name}  (Unicode names)


Remember to print the string to see what it looks like with the escape 
codes shown as actual characters, instead of escape codes.



-- 
Steve
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] unicode decode/encode issue

Reply via email to