Re: [Tutor] unicode decode/encode issue

bruce Mon, 26 Sep 2016 14:30:30 -0700

Hey folks. (peter!)

Thanks for the reply.


I wound up doing:

  #s=s.replace('\u2013', '-')
  #s=s.replace(u'\u2013', '-')
  #s=s.replace(u"\u2013", "-")
  #s=re.sub(u"\u2013", "-", s)
  s=s.encode("ascii", "ignore")
  s=s.replace(u"\u2013", "-")
  s=s.replace("&#8211;", "-")  ##<<< this was actually in the raw content
apparently

  print repr(s)

The test no longer has the unicode 'dash'

I'll revisit and simplify later. One or two of the above ines should be
able to be removed, and still have the unicode issue resolved.

Thanks


On Mon, Sep 26, 2016 at 1:54 PM, Peter Otten <[email protected]> wrote:

> bruce wrote:
>
> > Hi.
> >
> > Ive got a "basic" situation that should be simpl. So it must be a user
> > (me) issue!
> >
> >
> > I've got a page from a web fetch. I'm simply trying to go from utf-8 to
> > ascii. I'm not worried about any cruft that might get stripped out as the
> > data is generated from a us site. (It's a college/class dataset).
> >
> > I know this is a unicode issue. I know I need to have a much more
> > robust/ythnic/correct approach. I will later, but for now, just want to
> > resolve this issue, and get it off my plate so to speak.
> >
> > I've looked at stackoverflow, as well as numerous other sites, so I turn
> > to the group for a pointer or two...
> >
> > The unicode that I'm dealing with is 'u\2013'
> >
> > The basic things I've done up to now are:
> >
> >   s=content
> >   s=ascii_strip(s)
> >   s=s.replace('\u2013', '-')
> >   s=s.replace(u'\u2013', '-')
> >   s=s.replace(u"\u2013", "-")
> >   s=re.sub(u"\u2013", "-", s)
> >   print repr(s)
> >
> > When I look at the input content, I have :
> >
> >  u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
> >
> > So, any pointers on replacing the \u2013 with a simple '-' (dash) (or I
> > could even handle just a ' ' (space)
>
> I suppose you want to replace the DASH with HYPHEN-MINUS. For that both
>
> >   s=s.replace(u'\u2013', '-')
> >   s=s.replace(u"\u2013", "-")
>
> should work (the Python interpreter sees no difference between the two).
> Let's try:
>
> >>> s = u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
> >>> t = s.replace(u"\u2013", "-")
> >>> s == t
> False
> >>> s
> u'English 120 Course Syllabus \u2013 Fall \u2013 2006'
> >>> t
> u'English 120 Course Syllabus - Fall - 2006'
>
> So it look like you did not actually try the code you posted.
>
> To remove all non-ascii codepoints you can use encode():
>
> >>> s.encode("ascii", "ignore")
> 'English 120 Course Syllabus  Fall  2006'
>
> (Note that the result is a byte string)
>
>
> _______________________________________________
> Tutor maillist  -  [email protected]
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] unicode decode/encode issue

Reply via email to