Hi 

I am running into trouble with a nested query using python. To try and debug
it, I first wrote the query I want using sqlite3

select freq.docid, freqTranspose.docid, sum(freq.count *
freqTranspose.count) from
       Frequency as freq,
       (select term, docid, count from Frequency) as freqTranspose
       where freq.term = freqTranspose.term
       group by freq.docid, freqTranspose.docid
       ;


Sparksql has trouble parsing the "(select ) as freqTranspose ³ line

Here is what my input data looks like
$ head -n 3 reuters.db.csv
docid,term,count
10000_txt_earn,net,1
10000_txt_earn,rogers,4

The output from sqlite3 is
$ head -n 6 3hSimilarityMatrix.slow.sql.out
freq.docid      freqTranspose.docid  sum(freq.count * freqTranspose.count)
--------------  -------------------  -------------------------------------
10000_txt_earn  10000_txt_earn       127
10000_txt_earn  10054_txt_earn       33
10000_txt_earn  10080_txt_crude      146
10000_txt_earn  10088_txt_acq        11
$ 


My code example pretty much follows
http://spark.apache.org/docs/latest/sql-programming-guide.html

dataFile = sc.textFile("reuters.db.csv²)
lines = dataFile.map(lambda l: l.split(",²))
def mapLines(line) :
    ret = {}
    ret['docid'] = line[0]
    ret['term'] = line[1]
    ret['count'] = line[2]
    return ret
frequency = lines.map(mapLines)
schemaFrequency = sqlContext.inferSchema(frequency)
schemaFrequency.registerAsTable("frequency²)

Okay here is where I run into trouble

sqlCmd = "select \
            freq.docid, \
            freqTranspose.docid \
          from \
              frequency as freq, \
              (select term, docid, count from frequency)  \
        "
similarities = sqlContext.sql(sqlCmd)


/Users/andy/workSpace/dataBricksIntroToApacheSpark/USBStick/spark/python/lib
/py4j-0.8.1-src.zip/py4j/protocol.py in get_return_value(answer,
gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling o40.sql.
: java.lang.RuntimeException: [1.153] failure: ``('' expected but `from'
found

select             freq.docid,             freqTranspose.docid
from               frequency as freq,               (select term, docid,
count from frequency)
                   

Simple sql seems to ³parse² I.e. Select  freq.docid from frequency as freq

Any suggestions would be greatly appreciated.

Andy

P.s. I should note, I think I am using version 1.0 ?




Reply via email to