>From RELEASE: "Spark 1.3.1 built for Hadoop 2.4.0
Build flags: -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests -Pkinesis-asl -Pspark-ganglia-lgpl -Phadoop-provided -Phive -Phive-thriftserver " And this stacktrace may be more useful: http://pastebin.ca/3016483 On Tue, Jun 2, 2015 at 3:13 PM, Ignacio Zendejas <i...@node.io> wrote: > I've run into an error when trying to create a dataframe. Here's the code: > > -- > from pyspark import StorageLevel > from pyspark.sql import Row > > table = 'blah' > ssc = HiveContext(sc) > > data = sc.textFile('s3://bucket/some.tsv') > > def deserialize(s): > p = s.strip().split('\t') > p[-1] = float(p[-1]) > return Row(normalized_page_sha1=p[0], name=p[1], phrase=p[2], > created_at=p[3], layer_id=p[4], score=p[5]) > > blah = data.map(deserialize) > df = sqlContext.inferSchema(blah) > > --- > > I've also tried s3n and using createDataFrame. Our setup is on EMR > instances, using the setup script Amazon provides. After lots of debugging, > I suspect there'll be a problem with this setup. > > What's weird is that if I run this on pyspark shell, and re-run the last > line (inferSchema/createDataFrame), it actually works. > > We're getting warnings like this: > http://pastebin.ca/3016476 > > Here's the actual error: > http://www.pastebin.ca/3016473 > > Any help would be greatly appreciated. > > Thanks, > Ignacio > >