Thanks for the suggestion Ayan, it has not solved my problem but I did get
sqlContext to execute the SQL and return dataframe object. SQL is running fine
in the pyspark interpreter but not passing to SQL note (though it works fine
for a different dataset) - guess I'll take this question to the Zeppelin list.
Thanks again, more tips welcome if anyone sees anything funny...
------------
%pyspark
from pyspark.sql.types import Row, StructType, StructField, IntegerType,
StringType, DecimalType
from os import getcwd
sqlContext = SQLContext(sc)
datafile = sc.textFile("/Users/mitty01/data/geonames/CA.txt")
geonames = datafile.map(lambda s: s.split("\t")).map(lambda s: Row(
geonameid=int(s[0]), asciiname=str(s[2]), latitude=float(s[4]),
longitude=float(s[5]), elevation=str(s[16]), featureclass=str(s[6]),
featurecode=str(s[7]), countrycode=str(s[8]) ))
gndf = sqlContext.inferSchema(geonames)
gndf.registerTempTable('geonames')
#print gndf.count()
print "-----------"
print gndf.columns
print "-----------"
print gndf.first()
print "-----------"
gndf.schema
print "-----------"
sqlContext.sql("SELECT * FROM geonames LIMIT 10")
==== OUTPUT ====
----------- [u'asciiname', u'countrycode', u'elevation', u'featureclass',
u'featurecode', u'geonameid', u'latitude', u'longitude'] -----------
Row(asciiname=u'Swiftsure Bank', countrycode=u'CA', elevation=u'-9999',
featureclass=u'U', featurecode=u'BNKU', geonameid=4030308, latitude=48.55321,
longitude=-125.02235) -----------
StructType(List(StructField(asciiname,StringType,true),StructField(countrycode,StringType,true),StructField(elevation,StringType,true),StructField(featureclass,StringType,true),StructField(featurecode,StringType,true),StructField(geonameid,LongType,true),StructField(latitude,DoubleType,true),StructField(longitude,DoubleType,true)))
----------- DataFrame[asciiname: string, countrycode: string, elevation:
string, featureclass: string, featurecode: string, geonameid: bigint, latitude:
double, longitude: double]
==================
%sql
SELECT *
FROM geonames
LIMIT 1
no such table List(geonames); line 2 pos 5
________________________________
From: ayan guha <[email protected]>
Sent: May 11, 2015 12:27 AM
To: Tyler Mitchell
Cc: user
Subject: Re: Python -> SQL (geonames dataset)
Try this
Res = ssc.sql("your SQL without limit")
Print red.first()
Note: your SQL looks wrong as count will need a group by clause.
Best
Ayan
On 11 May 2015 16:22, "Tyler Mitchell"
<[email protected]<mailto:[email protected]>> wrote:
I'm using Python to setup a dataframe, but for some reason it is not being made
available to SQL. Code (from Zeppelin) below. I don't get any error when
loading/prepping the data or dataframe. Any tips?
(Originally I was not hardcoding the Row() structure, as my other tutorial
added it by default, not sure why it didn't work here, but that might be
besides the point.)
Any guesses greatly appreciated as I dig my teeth in here for the first time.
Thanks!
-------
%pyspark
from pyspark.sql.types import Row, StructType, StructField, IntegerType,
StringType, DecimalType
from os import getcwd
sqlContext = SQLContext(sc)
datafile = sc.textFile("/Users/tyler/data/geonames/CA.txt")
geonames = datafile.map(lambda s: s.split("\t")).map(lambda s: Row(
geonameid=int(s[0]), asciiname=str(s[2]), latitude=float(s[4]),
longitude=float(s[5]), elevation=str(s[16]), featureclass=str(s[6]),
featurecode=str(s[7]), countrycode=str(s[8]) ))
gndf = sqlContext.inferSchema(geonames)
gndf.registerAsTable("geonames")
#print gndf.count()
print "-----------"
print gndf.columns
print "-----------"
print gndf.first()
print "-----------"
gndf.schema
============
OUTPUT
============
[u'asciiname', u'countrycode', u'elevation', u'featureclass', u'featurecode',
u'geonameid', u'latitude', u'longitude']
-----------
Row(asciiname=u'100 Mile House', countrycode=u'CA', elevation=u'928',
featureclass=u'P', featurecode=u'PPL', geonameid=5881639, latitude=51.64982,
longitude=-121.28594)
-----------
StructType(List(StructField(asciiname,StringType,true),StructField(countrycode,StringType,true),StructField(elevation,StringType,true),StructField(featureclass,StringType,true),StructField(featurecode,StringType,true),StructField(geonameid,LongType,true),StructField(latitude,DoubleType,true),StructField(longitude,DoubleType,true)))
=============
%sql
SELECT geonameid, count(1) value
FROM geonames
LIMIT 1
no such table List(geonames); line 2 pos 5