Hi I am running into trouble with a nested query using python. To try and debug it, I first wrote the query I want using sqlite3
select freq.docid, freqTranspose.docid, sum(freq.count * freqTranspose.count) from Frequency as freq, (select term, docid, count from Frequency) as freqTranspose where freq.term = freqTranspose.term group by freq.docid, freqTranspose.docid ; Sparksql has trouble parsing the "(select ) as freqTranspose ³ line Here is what my input data looks like $ head -n 3 reuters.db.csv docid,term,count 10000_txt_earn,net,1 10000_txt_earn,rogers,4 The output from sqlite3 is $ head -n 6 3hSimilarityMatrix.slow.sql.out freq.docid freqTranspose.docid sum(freq.count * freqTranspose.count) -------------- ------------------- ------------------------------------- 10000_txt_earn 10000_txt_earn 127 10000_txt_earn 10054_txt_earn 33 10000_txt_earn 10080_txt_crude 146 10000_txt_earn 10088_txt_acq 11 $ My code example pretty much follows http://spark.apache.org/docs/latest/sql-programming-guide.html dataFile = sc.textFile("reuters.db.csv²) lines = dataFile.map(lambda l: l.split(",²)) def mapLines(line) : ret = {} ret['docid'] = line[0] ret['term'] = line[1] ret['count'] = line[2] return ret frequency = lines.map(mapLines) schemaFrequency = sqlContext.inferSchema(frequency) schemaFrequency.registerAsTable("frequency²) Okay here is where I run into trouble sqlCmd = "select \ freq.docid, \ freqTranspose.docid \ from \ frequency as freq, \ (select term, docid, count from frequency) \ " similarities = sqlContext.sql(sqlCmd) /Users/andy/workSpace/dataBricksIntroToApacheSpark/USBStick/spark/python/lib /py4j-0.8.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o40.sql. : java.lang.RuntimeException: [1.153] failure: ``('' expected but `from' found select freq.docid, freqTranspose.docid from frequency as freq, (select term, docid, count from frequency) Simple sql seems to ³parse² I.e. Select freq.docid from frequency as freq Any suggestions would be greatly appreciated. Andy P.s. I should note, I think I am using version 1.0 ?