Hi,
Thanks for the repsonse.
@ moon soo Lee: The interpreter setting is same in 0.7.0 and 0.7.1
@ Felix Cheng : The Python version is same.
The code is as follows:
*PYSPARK*
def textPreProcessor(text):
> for w in text.split():
>
>
> regex = re.compile('[%s]' % re.escape(string.punctuation))
>
> * *
> *no_punctuation = unicode(regex.sub(' ', w),'utf8')*
>
>
> tokens = word_tokenize(no_punctuation)
>
>
> lowercased = [t.lower() for t in tokens]
>
>
> no_stopwords = [w for w in lowercased if not w in stopwordsX]
>
>
> stemmed = [stemmerX.stem(w) for w in no_stopwords]
>
>
> return [w for w in stemmed if w]
> - docs =sc.textFile(hdfs_path+training_data,*use_unicode=False*
> ).repartition(96)
> - docs.map(lambda features: sentimentObject.textPreProcessor(features.
> split(delimiter)[text_colum])).count()
>
>
*Error:*
- UnicodeDecodeError: 'utf8' codec can't decode byte 0x9b in position
17: invalid start byte
- Same error *use_unicode=False* is not used
- Error change to *'ascii' codec can't decode byte 0x97 in position 3:
ordinal not in range(128) when **no_punctuation = regex.sub(' ', w)* is
used instead of *no_punctuation = unicode(regex.sub(' ', w),'utf8'). *
*Note :: In version 0.7.0 the code was running fine without using
use_unicode and unicode(regex.sub(' ', w),'utf8')*
*PYTHON*
def textPreProcessor(text_column):
> processed_text=[]
> for text in text_column:
> for w in text.split():
> regex = re.compile('[%s]' % re.escape(string.punctuation)) # reg
> exprn for puntuation
> no_punctuation = unicode(regex.sub(' ', text_),'utf8')
> tokens = word_tokenize(no_punctuation)
> lowercased = [t.lower() for t in tokens]
> no_stopwords = [w for w in lowercased if not w in stopwordsX]
> stemmed = [stemmerX.stem(w) for w in no_stopwords]
> processed_text.append([w for w in stemmed if w])
> return processed_text
- new_training = pd.read_csv(training_data,header=None,
delimiter=delimiter, error_bad_lines=False, usecols=[label_column,text_
column],names=['label','msg']).dropna()
- new_training['processed_msg'] = textPreProcessor(new_training['msg'])
This python code is working and I am getting result. In version 0.7.0, I am
getting output without using the unicode function.
Hope the problem is clear now.
Regards,
Meethu Mathew
On Fri, Apr 21, 2017 at 3:07 AM, Felix Cheung <[email protected]>
wrote:
> And are they running with the same Python version? What is the Python
> version?
>
> _____________________________
> From: moon soo Lee <[email protected]>
> Sent: Thursday, April 20, 2017 11:53 AM
> Subject: Re: UnicodeDecodeError in zeppelin 0.7.1
> To: <[email protected]>
>
>
>
> Hi,
>
> 0.7.1 didn't changed any encoding type as far as i know.
> One difference is 0.7.1 official artifact has been built with JDK8 while
> 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But
> i'm not sure that can make pyspark and spark encoding type changes.
>
> Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0?
>
> Thanks,
> moon
>
> On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <[email protected]>
> wrote:
>
>> Hi,
>>
>> I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing
>> this error while creating an RDD(in pyspark).
>>
>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
>>> invalid start byte
>>
>>
>> I was able to create the RDD without any error after adding
>> use_unicode=False as follows
>>
>>> sc.textFile("file.csv",use_unicode=False)
>>
>>
>> But it fails when I try to stem the text. I am getting similar error
>> when trying to apply stemming to the text using python interpreter.
>>
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
>>> ordinal not in range(128)
>>
>> All these code is working in 0.7.0 version. There is no change in the
>> dataset and code. Is there any change in the encoding type in the new
>> version of zeppelin?
>>
>> Regards,
>>
>>
>> Meethu Mathew
>>
>>
>
>