Hi All, I am getting in zeppelin 0.7.2 also with the following code. I had reported the same error in 0.7.1 as well (PFB the mail).
> def textPreProcessor(text): > for w in text.split(): > > > regex = re.compile('[%s]' % re.escape(string.punctuation)) > > * * > *no_punctuation = unicode(regex.sub(' ', w),'utf8')* > > > tokens = word_tokenize(no_punctuation) > > > lowercased = [t.lower() for t in tokens] > > > no_stopwords = [w for w in lowercased if not w in stopwordsX] > > > stemmed = [stemmerX.stem(w) for w in no_stopwords] > > > return [w for w in stemmed if w] > - docs =sc.textFile(hdfs_path+training_data,*use_unicode=False*). > repartition(96) > - docs.map(lambda features: sentimentObject.textPreProcess > or(features.split(delimiter)[text_colum])).count() > > *Note :: In version 0.7.0 the code was running fine without using use_unicode and unicode(regex.sub(' ', w),'utf8')* *Please help to fix this issue.* Regards, Meethu Mathew On Fri, Apr 21, 2017 at 11:26 AM, Meethu Mathew <meethu.mat...@flytxt.com> wrote: > > Hi, > > Thanks for the repsonse. > > @ moon soo Lee: The interpreter setting is same in 0.7.0 and 0.7.1 > > @ Felix Cheng : The Python version is same. > > The code is as follows: > > *PYSPARK* > > def textPreProcessor(text): >> for w in text.split(): >> >> >> regex = re.compile('[%s]' % re.escape(string.punctuation)) >> >> * * >> *no_punctuation = unicode(regex.sub(' ', w),'utf8')* >> >> >> tokens = word_tokenize(no_punctuation) >> >> >> lowercased = [t.lower() for t in tokens] >> >> >> no_stopwords = [w for w in lowercased if not w in stopwordsX] >> >> >> stemmed = [stemmerX.stem(w) for w in no_stopwords] >> >> >> return [w for w in stemmed if w] > > > >> - docs =sc.textFile(hdfs_path+training_data,*use_unicode=False*). >> repartition(96) >> - docs.map(lambda features: sentimentObject.textPreProcess >> or(features.split(delimiter)[text_colum])).count() >> >> > *Error:* > > - UnicodeDecodeError: 'utf8' codec can't decode byte 0x9b in position > 17: invalid start byte > > > - Same error *use_unicode=False* is not used > > > - Error change to *'ascii' codec can't decode byte 0x97 in position 3: > ordinal not in range(128) when **no_punctuation = regex.sub(' ', w)* is > used instead of *no_punctuation = unicode(regex.sub(' ', w),'utf8'). * > > *Note :: In version 0.7.0 the code was running fine without using > use_unicode and unicode(regex.sub(' ', w),'utf8')* > > *PYTHON* > > def textPreProcessor(text_column): >> processed_text=[] >> for text in text_column: >> for w in text.split(): >> regex = re.compile('[%s]' % re.escape(string.punctuation)) # >> reg exprn for puntuation >> no_punctuation = unicode(regex.sub(' ', text_),'utf8') >> tokens = word_tokenize(no_punctuation) >> lowercased = [t.lower() for t in tokens] >> no_stopwords = [w for w in lowercased if not w in stopwordsX] >> stemmed = [stemmerX.stem(w) for w in no_stopwords] >> processed_text.append([w for w in stemmed if w]) >> return processed_text > > > - new_training = pd.read_csv(training_data,header=None, > delimiter=delimiter, error_bad_lines=False, usecols=[label_column,text_ > column],names=['label','msg']).dropna() > - new_training['processed_msg'] = textPreProcessor(new_training['msg']) > > This python code is working and I am getting result. In version 0.7.0, I > am getting output without using the unicode function. > > Hope the problem is clear now. > > Regards, > Meethu Mathew > > > On Fri, Apr 21, 2017 at 3:07 AM, Felix Cheung <felixcheun...@hotmail.com> > wrote: > >> And are they running with the same Python version? What is the Python >> version? >> >> _____________________________ >> From: moon soo Lee <m...@apache.org> >> Sent: Thursday, April 20, 2017 11:53 AM >> Subject: Re: UnicodeDecodeError in zeppelin 0.7.1 >> To: <users@zeppelin.apache.org> >> >> >> >> Hi, >> >> 0.7.1 didn't changed any encoding type as far as i know. >> One difference is 0.7.1 official artifact has been built with JDK8 while >> 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But >> i'm not sure that can make pyspark and spark encoding type changes. >> >> Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0? >> >> Thanks, >> moon >> >> On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <meethu.mat...@flytxt.com> >> wrote: >> >>> Hi, >>> >>> I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing >>> this error while creating an RDD(in pyspark). >>> >>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: >>>> invalid start byte >>> >>> >>> I was able to create the RDD without any error after adding >>> use_unicode=False as follows >>> >>>> sc.textFile("file.csv",use_unicode=False) >>> >>> >>> But it fails when I try to stem the text. I am getting similar error >>> when trying to apply stemming to the text using python interpreter. >>> >>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: >>>> ordinal not in range(128) >>> >>> All these code is working in 0.7.0 version. There is no change in the >>> dataset and code. Is there any change in the encoding type in the new >>> version of zeppelin? >>> >>> Regards, >>> >>> >>> Meethu Mathew >>> >>> >> >> >