Hi Michael Changing my col name to something other the count¹ . Fixed the parse error
Many thanks, Andy From: Michael Armbrust <mich...@databricks.com> Reply-To: <user@spark.apache.org> Date: Sunday, July 13, 2014 at 1:18 PM To: <user@spark.apache.org> Cc: "u...@spark.incubator.apache.org" <u...@spark.incubator.apache.org> Subject: Re: SparkSql newbie problems with nested selects > Hi Andy, > > The SQL parser is pretty basic (we plan to improve this for the 1.2 release). > In this case I think part of the problem is that one of your variables is > "count", which is a reserved word. Unfortunately, we don't have the ability > to escape identifiers at this point. > > However, I did manage to get your query to parse using the HiveQL parser, > provided by HiveContext. > > hiveCtx.hql(""" > select freq.docid, freqTranspose.docid, sum(freq.count * freqTranspose.count) > from > Frequency freq JOIN > (select term, docid, count from Frequency) freqTranspose > where freq.term = freqTranspose.term > group by freq.docid, freqTranspose.docid""") > > Michael > > > On Sun, Jul 13, 2014 at 12:43 PM, Andy Davidson > <a...@santacruzintegration.com> wrote: >> Hi >> >> I am running into trouble with a nested query using python. To try and debug >> it, I first wrote the query I want using sqlite3 >> >> select freq.docid, freqTranspose.docid, sum(freq.count * freqTranspose.count) >> from >> Frequency as freq, >> (select term, docid, count from Frequency) as freqTranspose >> where freq.term = freqTranspose.term >> group by freq.docid, freqTranspose.docid >> ; >> >> >> Sparksql has trouble parsing the "(select ) as freqTranspose ³ line >> >> Here is what my input data looks like >> $ head -n 3 reuters.db.csv >> docid,term,count >> 10000_txt_earn,net,1 >> 10000_txt_earn,rogers,4 >> >> The output from sqlite3 is >> $ head -n 6 3hSimilarityMatrix.slow.sql.out >> freq.docid freqTranspose.docid sum(freq.count * freqTranspose.count) >> -------------- ------------------- ------------------------------------- >> 10000_txt_earn 10000_txt_earn 127 >> 10000_txt_earn 10054_txt_earn 33 >> 10000_txt_earn 10080_txt_crude 146 >> 10000_txt_earn 10088_txt_acq 11 >> $ >> >> >> My code example pretty much follows >> http://spark.apache.org/docs/latest/sql-programming-guide.html >> >> dataFile = sc.textFile("reuters.db.csv²) >> lines = dataFile.map(lambda l: l.split(",²)) >> def mapLines(line) : >> ret = {} >> ret['docid'] = line[0] >> ret['term'] = line[1] >> ret['count'] = line[2] >> return ret >> frequency = lines.map(mapLines) >> schemaFrequency = sqlContext.inferSchema(frequency) >> schemaFrequency.registerAsTable("frequency²) >> >> Okay here is where I run into trouble >> >> sqlCmd = "select \ >> freq.docid, \ >> freqTranspose.docid \ >> from \ >> frequency as freq, \ >> (select term, docid, count from frequency) \ >> " >> similarities = sqlContext.sql(sqlCmd) >> >> >> /Users/andy/workSpace/dataBricksIntroToApacheSpark/USBStick/spark/python/lib/ >> py4j-0.8.1-src.zip/py4j/protocol.py in get_return_value(answer, >> gateway_client, target_id, name) 298 raise Py4JJavaError( >> 299 'An error occurred while calling >> {0}{1}{2}.\n'.--> 300 format(target_id, '.', name), >> value) >> 301 else: 302 raise Py4JError( >> >> Py4JJavaError: An error occurred while calling o40.sql. >> : java.lang.RuntimeException: [1.153] failure: ``('' expected but `from' >> found >> >> select freq.docid, freqTranspose.docid from >> frequency as freq, (select term, docid, count from frequency) >> >> >> Simple sql seems to ³parse² I.e. Select freq.docid from frequency as freq >> >> Any suggestions would be greatly appreciated. >> >> Andy >> >> P.s. I should note, I think I am using version 1.0 ? >> >> >