Hi All,
I am getting in zeppelin 0.7.2 also with the following code. I had reported
the same error in 0.7.1 as well (PFB the mail).
> def textPreProcessor(text):
> for w in text.split():
>
>
> regex = re.compile('[%s]' % re.escape(string.punctuation))
>
> * *
> *no_punctuation = unicode(regex.sub(' ', w),'utf8')*
>
>
> tokens = word_tokenize(no_punctuation)
>
>
> lowercased = [t.lower() for t in tokens]
>
>
> no_stopwords = [w for w in lowercased if not w in stopwordsX]
>
>
> stemmed = [stemmerX.stem(w) for w in no_stopwords]
>
>
> return [w for w in stemmed if w]
> - docs =sc.textFile(hdfs_path+training_data,*use_unicode=False*).
> repartition(96)
> - docs.map(lambda features: sentimentObject.textPreProcess
> or(features.split(delimiter)[text_colum])).count()
>
>
*Note :: In version 0.7.0 the code was running fine without
using use_unicode and unicode(regex.sub(' ', w),'utf8')*
*Please help to fix this issue.*
Regards,
Meethu Mathew
On Fri, Apr 21, 2017 at 11:26 AM, Meethu Mathew <[email protected]>
wrote:
>
> Hi,
>
> Thanks for the repsonse.
>
> @ moon soo Lee: The interpreter setting is same in 0.7.0 and 0.7.1
>
> @ Felix Cheng : The Python version is same.
>
> The code is as follows:
>
> *PYSPARK*
>
> def textPreProcessor(text):
>> for w in text.split():
>>
>>
>> regex = re.compile('[%s]' % re.escape(string.punctuation))
>>
>> * *
>> *no_punctuation = unicode(regex.sub(' ', w),'utf8')*
>>
>>
>> tokens = word_tokenize(no_punctuation)
>>
>>
>> lowercased = [t.lower() for t in tokens]
>>
>>
>> no_stopwords = [w for w in lowercased if not w in stopwordsX]
>>
>>
>> stemmed = [stemmerX.stem(w) for w in no_stopwords]
>>
>>
>> return [w for w in stemmed if w]
>
>
>
>> - docs =sc.textFile(hdfs_path+training_data,*use_unicode=False*).
>> repartition(96)
>> - docs.map(lambda features: sentimentObject.textPreProcess
>> or(features.split(delimiter)[text_colum])).count()
>>
>>
> *Error:*
>
> - UnicodeDecodeError: 'utf8' codec can't decode byte 0x9b in position
> 17: invalid start byte
>
>
> - Same error *use_unicode=False* is not used
>
>
> - Error change to *'ascii' codec can't decode byte 0x97 in position 3:
> ordinal not in range(128) when **no_punctuation = regex.sub(' ', w)* is
> used instead of *no_punctuation = unicode(regex.sub(' ', w),'utf8'). *
>
> *Note :: In version 0.7.0 the code was running fine without using
> use_unicode and unicode(regex.sub(' ', w),'utf8')*
>
> *PYTHON*
>
> def textPreProcessor(text_column):
>> processed_text=[]
>> for text in text_column:
>> for w in text.split():
>> regex = re.compile('[%s]' % re.escape(string.punctuation)) #
>> reg exprn for puntuation
>> no_punctuation = unicode(regex.sub(' ', text_),'utf8')
>> tokens = word_tokenize(no_punctuation)
>> lowercased = [t.lower() for t in tokens]
>> no_stopwords = [w for w in lowercased if not w in stopwordsX]
>> stemmed = [stemmerX.stem(w) for w in no_stopwords]
>> processed_text.append([w for w in stemmed if w])
>> return processed_text
>
>
> - new_training = pd.read_csv(training_data,header=None,
> delimiter=delimiter, error_bad_lines=False, usecols=[label_column,text_
> column],names=['label','msg']).dropna()
> - new_training['processed_msg'] = textPreProcessor(new_training['msg'])
>
> This python code is working and I am getting result. In version 0.7.0, I
> am getting output without using the unicode function.
>
> Hope the problem is clear now.
>
> Regards,
> Meethu Mathew
>
>
> On Fri, Apr 21, 2017 at 3:07 AM, Felix Cheung <[email protected]>
> wrote:
>
>> And are they running with the same Python version? What is the Python
>> version?
>>
>> _____________________________
>> From: moon soo Lee <[email protected]>
>> Sent: Thursday, April 20, 2017 11:53 AM
>> Subject: Re: UnicodeDecodeError in zeppelin 0.7.1
>> To: <[email protected]>
>>
>>
>>
>> Hi,
>>
>> 0.7.1 didn't changed any encoding type as far as i know.
>> One difference is 0.7.1 official artifact has been built with JDK8 while
>> 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But
>> i'm not sure that can make pyspark and spark encoding type changes.
>>
>> Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0?
>>
>> Thanks,
>> moon
>>
>> On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing
>>> this error while creating an RDD(in pyspark).
>>>
>>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
>>>> invalid start byte
>>>
>>>
>>> I was able to create the RDD without any error after adding
>>> use_unicode=False as follows
>>>
>>>> sc.textFile("file.csv",use_unicode=False)
>>>
>>>
>>> But it fails when I try to stem the text. I am getting similar error
>>> when trying to apply stemming to the text using python interpreter.
>>>
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
>>>> ordinal not in range(128)
>>>
>>> All these code is working in 0.7.0 version. There is no change in the
>>> dataset and code. Is there any change in the encoding type in the new
>>> version of zeppelin?
>>>
>>> Regards,
>>>
>>>
>>> Meethu Mathew
>>>
>>>
>>
>>
>