Re: createDataframe from s3 results in error

Reynold Xin Tue, 02 Jun 2015 15:27:40 -0700

Maybe an incompatible Hive package or Hive metastore?

On Tue, Jun 2, 2015 at 3:25 PM, Ignacio Zendejas <i...@node.io> wrote:


> From RELEASE:
>
> "Spark 1.3.1 built for Hadoop 2.4.0
>
> Build flags: -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
> -Pkinesis-asl -Pspark-ganglia-lgpl -Phadoop-provided -Phive
> -Phive-thriftserver
>
> "
> And this stacktrace may be more useful:
> http://pastebin.ca/3016483
>
> On Tue, Jun 2, 2015 at 3:13 PM, Ignacio Zendejas <i...@node.io> wrote:
>
>> I've run into an error when trying to create a dataframe. Here's the code:
>>
>> --
>> from pyspark import StorageLevel
>> from pyspark.sql import Row
>>
>> table = 'blah'
>> ssc = HiveContext(sc)
>>
>> data = sc.textFile('s3://bucket/some.tsv')
>>
>> def deserialize(s):
>>   p = s.strip().split('\t')
>>   p[-1] = float(p[-1])
>>   return Row(normalized_page_sha1=p[0], name=p[1], phrase=p[2],
>> created_at=p[3], layer_id=p[4], score=p[5])
>>
>> blah = data.map(deserialize)
>> df = sqlContext.inferSchema(blah)
>>
>> ---
>>
>> I've also tried s3n and using createDataFrame. Our setup is on EMR
>> instances, using the setup script Amazon provides. After lots of debugging,
>> I suspect there'll be a problem with this setup.
>>
>> What's weird is that if I run this on pyspark shell, and re-run the last
>> line (inferSchema/createDataFrame), it actually works.
>>
>> We're getting warnings like this:
>> http://pastebin.ca/3016476
>>
>> Here's the actual error:
>> http://www.pastebin.ca/3016473
>>
>> Any help would be greatly appreciated.
>>
>> Thanks,
>> Ignacio
>>
>>
>

Re: createDataframe from s3 results in error

Reply via email to