Re: ascii to latin1

2006-05-10 Thread Luis P. Mendes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 > > With regards to º, Richie already gave you food for thoughts, if you > want "1 DE MO" to match "1º DE MO" remove that symbol from the key > (linha_key = linha_key.translate({u"º": None}), if you don't want such > a fuzzy matching, keep it. > Tha

Re: ascii to latin1

2006-05-10 Thread Serge Orlov
Luis P. Mendes wrote: > Errors occur when I assign the result of ''.join(cp for cp in de_str if > not unicodedata.category(cp).startswith('M')) to a variable. The same > happens with de_str. When I print the strings everything is ok. > > Here's a short example of data: > 115448,DAÇÃO > 117788,DA

Re: ascii to latin1

2006-05-09 Thread richie
[Luis] > The script converted the ÇÃ from the first line, but not the º from > the second one. That's because º, 0xba, MASCULINE ORDINAL INDICATOR is classed as a letter and not a diacritic: http://www.fileformat.info/info/unicode/char/00ba/index.htm You can't encode it in ascii because it's n

Re: ascii to latin1

2006-05-09 Thread Peter Otten
Luis P. Mendes wrote: > The script converted the ÇÃ from the first line, but not the º from the > second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a > [u'115448,DAÇÃO'] element, which doesn't suit my needs. > > Would you mind telling me what should I change? Sometimes you a

Re: ascii to latin1

2006-05-09 Thread Luis P. Mendes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 >> When I used the "NFD" option, I came across many errors on these and >> possibly other codes: \xba, \xc9, \xcd. > > What errors? normalize method is not supposed to give any errors. You > mean it doesn't work as expected? Well, I have to admit that

Re: ascii to latin1

2006-05-09 Thread Richie Hindle
[Serge] > I have to admit that using > normalize is a far from perfect way to implement search. The most > advanced algorithm is published by Unicode guys: > If you read it you'll understand > it's not so easy. I only have to look at the length of the docum

Re: ascii to latin1

2006-05-09 Thread Serge Orlov
Luis P. Mendes wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Richie Hindle escreveu: > > [Serge] > >> def search_key(s): > >> de_str = unicodedata.normalize("NFD", s) > >> return ''.join(cp for cp in de_str if not > >>unicodedata.category(cp).startswith('M

Re: ascii to latin1

2006-05-09 Thread Serge Orlov
Richie Hindle wrote: > [Serge] > > def search_key(s): > > de_str = unicodedata.normalize("NFD", s) > > return ''.join(cp for cp in de_str if not > >unicodedata.category(cp).startswith('M')) > > Lovely bit of code - thanks for posting it! Well, it is not so good. Please

Re: ascii to latin1

2006-05-09 Thread Richie Hindle
[Luis] > When I used the "NFD" option, I came across many errors on these and > possibly other codes: \xba, \xc9, \xcd. What errors? This works fine for me, printing "Ecoute": import unicodedata def search_key(s): de_str = unicodedata.normalize("NFD", s) return ''.join([cp for cp in de_

Re: ascii to latin1

2006-05-09 Thread Luis P. Mendes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Richie Hindle escreveu: > [Serge] >> def search_key(s): >> de_str = unicodedata.normalize("NFD", s) >> return ''.join(cp for cp in de_str if not >>unicodedata.category(cp).startswith('M')) > > Lovely bit of code - thanks fo

Re: ascii to latin1

2006-05-09 Thread Richie Hindle
[Serge] > def search_key(s): > de_str = unicodedata.normalize("NFD", s) > return ''.join(cp for cp in de_str if not >unicodedata.category(cp).startswith('M')) Lovely bit of code - thanks for posting it! You might want to use "NFKD" to normalize things like LATIN SMALL

Re: ascii to latin1

2006-05-08 Thread Serge Orlov
r > 'televisao', 'televisão' or even 'télévisao' (this last one doesn't > exist in Portuguese) is successful. > > So, instead of only one search, there will be several used. > > Is there anything already coded, or will I have to try to do it all b

Re: ascii to latin1

2006-05-08 Thread Rene Pijlman
Luis P. Mendes: >I'm developing a django based intranet web server that has a search page. > >Data contained in the database is mixed. Some of the words are >accented, some are not but they should be. This is because the >collection of data began a long time ago when ascii was the only way to go

Re: ascii to latin1

2006-05-08 Thread Robert Kern
Luis P. Mendes wrote: > example: > if the word searched is 'televisão', I want that a search by either > 'televisao', 'televisão' or even 'télévisao' (this last one doesn't > exist in Portuguese) is successful. The ICU library has the capability to transliterate strings via certain rulesets. One

ascii to latin1

2006-05-08 Thread Luis P. Mendes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I'm developing a django based intranet web server that has a search page. Data contained in the database is mixed. Some of the words are accented, some are not but they should be. This is because the collection of data began a long time ago wh