Maybe an incompatible Hive package or Hive metastore? On Tue, Jun 2, 2015 at 3:25 PM, Ignacio Zendejas <i...@node.io> wrote:
> From RELEASE: > > "Spark 1.3.1 built for Hadoop 2.4.0 > > Build flags: -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests > -Pkinesis-asl -Pspark-ganglia-lgpl -Phadoop-provided -Phive > -Phive-thriftserver > > " > And this stacktrace may be more useful: > http://pastebin.ca/3016483 > > On Tue, Jun 2, 2015 at 3:13 PM, Ignacio Zendejas <i...@node.io> wrote: > >> I've run into an error when trying to create a dataframe. Here's the code: >> >> -- >> from pyspark import StorageLevel >> from pyspark.sql import Row >> >> table = 'blah' >> ssc = HiveContext(sc) >> >> data = sc.textFile('s3://bucket/some.tsv') >> >> def deserialize(s): >> p = s.strip().split('\t') >> p[-1] = float(p[-1]) >> return Row(normalized_page_sha1=p[0], name=p[1], phrase=p[2], >> created_at=p[3], layer_id=p[4], score=p[5]) >> >> blah = data.map(deserialize) >> df = sqlContext.inferSchema(blah) >> >> --- >> >> I've also tried s3n and using createDataFrame. Our setup is on EMR >> instances, using the setup script Amazon provides. After lots of debugging, >> I suspect there'll be a problem with this setup. >> >> What's weird is that if I run this on pyspark shell, and re-run the last >> line (inferSchema/createDataFrame), it actually works. >> >> We're getting warnings like this: >> http://pastebin.ca/3016476 >> >> Here's the actual error: >> http://www.pastebin.ca/3016473 >> >> Any help would be greatly appreciated. >> >> Thanks, >> Ignacio >> >> >