Re: extracting from web pages but got disordered words sometimes

2007-01-27 Thread Frank Potter
Thank you, I tried again and I figured it out. That's something with beautiful soup, I worked with it a year ago also dealing with Chinese html pages and nothing error happened. I read the old code and I find the difference. Change the page to unicode before feeding to beautiful soup, then every

Re: extracting from web pages but got disordered words sometimes

2007-01-27 Thread Paul McGuire
After looking at the pyparsing results, I think I see the problem with your original code. You are selecting only the characters after the rightmost "-" character, but you really want to select everything to the right of "- -". In some of the titles, the encoded Chinese includes a "-" charac

Re: extracting from web pages but got disordered words sometimes

2007-01-27 Thread Paul McGuire
On Jan 27, 5:18 am, "Frank Potter" <[EMAIL PROTECTED]> wrote: > There are ten web pages I want to deal with. > fromhttp://www.af.shejis.com/new_lw/html/125926.shtml > to http://www.af.shejis.com/new_lw/html/125936.shtml > > Each of them uses the charset of Chinese "gb2312", and firefox > displ

extracting from web pages but got disordered words sometimes

2007-01-27 Thread Frank Potter
There are ten web pages I want to deal with. from http://www.af.shejis.com/new_lw/html/125926.shtml to http://www.af.shejis.com/new_lw/html/125936.shtml Each of them uses the charset of Chinese "gb2312", and firefox displays all of them in the right form, that's readable Chinese. My job is,