Hi All,

I am getting in zeppelin 0.7.2 also with the following code. I had reported
the same error in 0.7.1 as well (PFB the mail).

​

> def textPreProcessor(text):
>      for w in text.split():
>
> ​     ​
> regex = re.compile('[%s]' % re.escape(string.punctuation))
>
> ​    * ​*
> *no_punctuation = unicode(regex.sub(' ', w),'utf8')*
>
> ​     ​
> tokens = word_tokenize(no_punctuation)
>
> ​     ​
> lowercased = [t.lower() for t in tokens]
>
> ​     ​
> no_stopwords = [w for w in lowercased if not w in stopwordsX]
>
> ​     ​
> stemmed = [stemmerX.stem(w) for w in no_stopwords]
>
> ​     ​
> return [w for w in stemmed if w]



>    - docs =sc.textFile(hdfs_path+training_data,*use_unicode=False*).
>    repartition(96)
>    - docs.map(lambda features: sentimentObject.textPreProcess
>    or(features.split(delimiter)[text_colum])).count()
>
> ​
*​Note :: In version 0.7.0 the code was running fine without
using use_unicode and unicode(regex.sub(' ', w),'utf8')*

*Please help to fix this issue.*

Regards,
Meethu Mathew


On Fri, Apr 21, 2017 at 11:26 AM, Meethu Mathew <meethu.mat...@flytxt.com>
wrote:

> ​​
> Hi,
>
> Thanks for the repsonse.
>
> @ moon soo Lee: The interpreter setting is same in 0.7.0 and 0.7.1
>
> @ Felix Cheng : The Python version is same.
>
> The code is as follows:
>
> *PYSPARK*
>
> def textPreProcessor(text):
>>      for w in text.split():
>>
>> ​     ​
>> regex = re.compile('[%s]' % re.escape(string.punctuation))
>>
>> ​    * ​*
>> *no_punctuation = unicode(regex.sub(' ', w),'utf8')*
>>
>> ​     ​
>> tokens = word_tokenize(no_punctuation)
>>
>> ​     ​
>> lowercased = [t.lower() for t in tokens]
>>
>> ​     ​
>> no_stopwords = [w for w in lowercased if not w in stopwordsX]
>>
>> ​     ​
>> stemmed = [stemmerX.stem(w) for w in no_stopwords]
>>
>> ​     ​
>> return [w for w in stemmed if w]
>
>
>
>>    - docs =sc.textFile(hdfs_path+training_data,*use_unicode=False*).
>>    repartition(96)
>>    - docs.map(lambda features: sentimentObject.textPreProcess
>>    or(features.split(delimiter)[text_colum])).count()
>>
>>
> *Error:*
>
>    - UnicodeDecodeError: 'utf8' codec can't decode byte 0x9b in position
>    17: invalid start byte
>
>
>    - Same error  *use_unicode=False* is not used
>
>
>    - Error change to *'ascii' codec can't decode byte 0x97 in position 3:
>    ordinal not in range(128) when **no_punctuation = regex.sub(' ', w)* is
>    used instead of *no_punctuation = unicode(regex.sub(' ', w),'utf8'). *
>
> *​​Note :: In version 0.7.0 the code was running fine without using
> use_unicode and unicode(regex.sub(' ', w),'utf8')*
>
> *PYTHON*
>
> def textPreProcessor(text_column):
>>     processed_text=[]
>> for text in text_column:
>>        for w in text.split():
>>           regex = re.compile('[%s]' % re.escape(string.punctuation)) #
>> reg exprn for puntuation
>>           no_punctuation = unicode(regex.sub(' ', text_),'utf8')
>>              tokens = word_tokenize(no_punctuation)
>>                  lowercased = [t.lower() for t in tokens]
>>            no_stopwords = [w for w in lowercased if not w in stopwordsX]
>>            stemmed = [stemmerX.stem(w) for w in no_stopwords]
>>            processed_text.append([w for w in stemmed if w])
>> return processed_text
>
>
>    - new_training = pd.read_csv(training_data,header=None,
>    delimiter=delimiter, error_bad_lines=False, usecols=[label_column,text_
>    column],names=['label','msg']).dropna()
>    - new_training['processed_msg'] = textPreProcessor(new_training['msg'])
>
> This python code is working and I am getting result. In version 0.7.0, I
> am getting output without using the unicode function.
>
> Hope the problem is clear now.
>
> Regards,
> Meethu Mathew
>
>
> On Fri, Apr 21, 2017 at 3:07 AM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> And are they running with the same Python version? What is the Python
>> version?
>>
>> _____________________________
>> From: moon soo Lee <m...@apache.org>
>> Sent: Thursday, April 20, 2017 11:53 AM
>> Subject: Re: UnicodeDecodeError in zeppelin 0.7.1
>> To: <users@zeppelin.apache.org>
>>
>>
>>
>> Hi,
>>
>> 0.7.1 didn't changed any encoding type as far as i know.
>> One difference is 0.7.1 official artifact has been built with JDK8 while
>> 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But
>> i'm not sure that can make pyspark and spark encoding type changes.
>>
>> Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0?
>>
>> Thanks,
>> moon
>>
>> On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <meethu.mat...@flytxt.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing
>>> this error while creating an RDD(in pyspark).
>>>
>>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
>>>> invalid start byte
>>>
>>>
>>> I was able to create the RDD without any error after adding
>>> use_unicode=False as follows
>>>
>>>> sc.textFile("file.csv",use_unicode=False)
>>>
>>>
>>> ​But it fails when I try to stem the text. I am getting similar error
>>> when trying to apply stemming to the text using python interpreter.
>>>
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
>>>> ordinal not in range(128)
>>>
>>> All these code is working in 0.7.0 version. There is no change in the
>>> dataset and code. ​Is there any change in the encoding type in the new
>>> version of zeppelin?
>>>
>>> Regards,
>>>
>>>
>>> Meethu Mathew
>>>
>>>
>>
>>
>

Reply via email to