Looks correct to me. Try for example: from pyspark.sql.functions import * df.withColumn("value", explode(df['values'])).show()
On Mon, Oct 5, 2015 at 2:15 PM, Fernando Paladini <fnpalad...@gmail.com> wrote: > Update: > > I've updated my code and now I have the following JSON: > https://gist.github.com/paladini/27bb5636d91dec79bd56 > In the same link you can check the output from "spark-submit > myPythonScript.py", where I call "myDataframe.show()". The following is > printed by Spark (among other useless debug information): > > > > That's correct for the given JSON input > <https://gist.github.com/paladini/27bb5636d91dec79bd56> (gist link > above)? How can I test if Spark can understand this DataFrame and make > complex manipulations with that? > > Thank you! Hope you can help me soon :3 > Fernando Paladini. > > 2015-10-05 15:23 GMT-03:00 Fernando Paladini <fnpalad...@gmail.com>: > >> Thank you for the replies and sorry about the delay, my e-mail client >> send this conversation to Spam (??). >> >> I'll take a look in your tips and come back later to post my questions / >> progress. Again, thank you so much! >> >> 2015-09-30 18:37 GMT-03:00 Michael Armbrust <mich...@databricks.com>: >> >>> I think the problem here is that you are passing in parsed JSON that >>> stored as a dictionary (which is converted to a hashmap when going into the >>> JVM). You should instead be passing in the path to the json file >>> (formatted as Akhil suggests) so that Spark can do the parsing in >>> parallel. The other option would be to construct and RDD of JSON string >>> and pass that to the JSON method. >>> >>> On Wed, Sep 30, 2015 at 2:28 AM, Akhil Das <ak...@sigmoidanalytics.com> >>> wrote: >>> >>>> Each Json Doc should be in a single line i guess. >>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets >>>> >>>> Note that the file that is offered as *a json file* is not a typical >>>> JSON file. Each line must contain a separate, self-contained valid JSON >>>> object. As a consequence, a regular multi-line JSON file will most often >>>> fail. >>>> >>>> Thanks >>>> Best Regards >>>> >>>> On Tue, Sep 29, 2015 at 11:07 AM, Fernando Paladini < >>>> fnpalad...@gmail.com> wrote: >>>> >>>>> Hello guys, >>>>> >>>>> I'm very new to Spark and I'm having some troubles when reading a JSON >>>>> to dataframe on PySpark. >>>>> >>>>> I'm getting a JSON object from an API response and I would like to >>>>> store it in Spark as a DataFrame (I've read that DataFrame is better than >>>>> RDD, that's accurate?). For what I've read >>>>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext> >>>>> on documentation, I just need to call the method sqlContext.read.json in >>>>> order to do what I want. >>>>> >>>>> *Following is the code from my test application:* >>>>> json_object = json.loads(response.text) >>>>> sc = SparkContext("local", appName="JSON to RDD") >>>>> sqlContext = SQLContext(sc) >>>>> dataframe = sqlContext.read.json(json_object) >>>>> dataframe.show() >>>>> >>>>> *The problem is that when I run **"spark-submit myExample.py" I got >>>>> the following error:* >>>>> 15/09/29 01:18:54 INFO BlockManagerMasterEndpoint: Registering block >>>>> manager localhost:48634 with 530.0 MB RAM, BlockManagerId(driver, >>>>> localhost, 48634) >>>>> 15/09/29 01:18:54 INFO BlockManagerMaster: Registered BlockManager >>>>> Traceback (most recent call last): >>>>> File "/home/paladini/ufxc/lisha/learning/spark-api-kairos/test1.py", >>>>> line 35, in <module> >>>>> dataframe = sqlContext.read.json(json_object) >>>>> File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", >>>>> line 144, in json >>>>> File >>>>> "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line >>>>> 538, in __call__ >>>>> File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line >>>>> 36, in deco >>>>> File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", >>>>> line 304, in get_return_value >>>>> py4j.protocol.Py4JError: An error occurred while calling o21.json. >>>>> Trace: >>>>> py4j.Py4JException: Method json([class java.util.HashMap]) does not >>>>> exist >>>>> at >>>>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) >>>>> at >>>>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) >>>>> at py4j.Gateway.invoke(Gateway.java:252) >>>>> at >>>>> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) >>>>> at py4j.commands.CallCommand.execute(CallCommand.java:79) >>>>> at py4j.GatewayConnection.run(GatewayConnection.java:207) >>>>> at java.lang.Thread.run(Thread.java:745) >>>>> >>>>> *What I'm doing wrong? * >>>>> Check out this gist >>>>> <https://gist.github.com/paladini/2e2ea913d545a407b842> to see the >>>>> JSON I'm trying to load. >>>>> >>>>> Thanks! >>>>> Fernando Paladini >>>>> >>>> >>>> >>> >> >> >> -- >> Fernando Paladini >> > > > > -- > Fernando Paladini >