Thank you for the replies and sorry about the delay, my e-mail client send this conversation to Spam (??).
I'll take a look in your tips and come back later to post my questions / progress. Again, thank you so much! 2015-09-30 18:37 GMT-03:00 Michael Armbrust <mich...@databricks.com>: > I think the problem here is that you are passing in parsed JSON that > stored as a dictionary (which is converted to a hashmap when going into the > JVM). You should instead be passing in the path to the json file > (formatted as Akhil suggests) so that Spark can do the parsing in > parallel. The other option would be to construct and RDD of JSON string > and pass that to the JSON method. > > On Wed, Sep 30, 2015 at 2:28 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Each Json Doc should be in a single line i guess. >> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets >> >> Note that the file that is offered as *a json file* is not a typical >> JSON file. Each line must contain a separate, self-contained valid JSON >> object. As a consequence, a regular multi-line JSON file will most often >> fail. >> >> Thanks >> Best Regards >> >> On Tue, Sep 29, 2015 at 11:07 AM, Fernando Paladini <fnpalad...@gmail.com >> > wrote: >> >>> Hello guys, >>> >>> I'm very new to Spark and I'm having some troubles when reading a JSON >>> to dataframe on PySpark. >>> >>> I'm getting a JSON object from an API response and I would like to store >>> it in Spark as a DataFrame (I've read that DataFrame is better than RDD, >>> that's accurate?). For what I've read >>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext> >>> on documentation, I just need to call the method sqlContext.read.json in >>> order to do what I want. >>> >>> *Following is the code from my test application:* >>> json_object = json.loads(response.text) >>> sc = SparkContext("local", appName="JSON to RDD") >>> sqlContext = SQLContext(sc) >>> dataframe = sqlContext.read.json(json_object) >>> dataframe.show() >>> >>> *The problem is that when I run **"spark-submit myExample.py" I got the >>> following error:* >>> 15/09/29 01:18:54 INFO BlockManagerMasterEndpoint: Registering block >>> manager localhost:48634 with 530.0 MB RAM, BlockManagerId(driver, >>> localhost, 48634) >>> 15/09/29 01:18:54 INFO BlockManagerMaster: Registered BlockManager >>> Traceback (most recent call last): >>> File "/home/paladini/ufxc/lisha/learning/spark-api-kairos/test1.py", >>> line 35, in <module> >>> dataframe = sqlContext.read.json(json_object) >>> File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", >>> line 144, in json >>> File >>> "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line >>> 538, in __call__ >>> File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line >>> 36, in deco >>> File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", >>> line 304, in get_return_value >>> py4j.protocol.Py4JError: An error occurred while calling o21.json. Trace: >>> py4j.Py4JException: Method json([class java.util.HashMap]) does not exist >>> at >>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) >>> at >>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) >>> at py4j.Gateway.invoke(Gateway.java:252) >>> at >>> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) >>> at py4j.commands.CallCommand.execute(CallCommand.java:79) >>> at py4j.GatewayConnection.run(GatewayConnection.java:207) >>> at java.lang.Thread.run(Thread.java:745) >>> >>> *What I'm doing wrong? * >>> Check out this gist >>> <https://gist.github.com/paladini/2e2ea913d545a407b842> to see the JSON >>> I'm trying to load. >>> >>> Thanks! >>> Fernando Paladini >>> >> >> > -- Fernando Paladini