Thank you for the replies and sorry about the delay, my e-mail client send
this conversation to Spam (??).

I'll take a look in your tips and come back later to post my questions /
progress. Again, thank you so much!

2015-09-30 18:37 GMT-03:00 Michael Armbrust <mich...@databricks.com>:

> I think the problem here is that you are passing in parsed JSON that
> stored as a dictionary (which is converted to a hashmap when going into the
> JVM).  You should instead be passing in the path to the json file
> (formatted as Akhil suggests) so that Spark can do the parsing in
> parallel.  The other option would be to construct and RDD of JSON string
> and pass that to the JSON method.
>
> On Wed, Sep 30, 2015 at 2:28 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Each Json Doc should be in a single line i guess.
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>>
>> Note that the file that is offered as *a json file* is not a typical
>> JSON file. Each line must contain a separate, self-contained valid JSON
>> object. As a consequence, a regular multi-line JSON file will most often
>> fail.
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Sep 29, 2015 at 11:07 AM, Fernando Paladini <fnpalad...@gmail.com
>> > wrote:
>>
>>> Hello guys,
>>>
>>> I'm very new to Spark and I'm having some troubles when reading a JSON
>>> to dataframe on PySpark.
>>>
>>> I'm getting a JSON object from an API response and I would like to store
>>> it in Spark as a DataFrame (I've read that DataFrame is better than RDD,
>>> that's accurate?). For what I've read
>>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext>
>>> on documentation, I just need to call the method sqlContext.read.json in
>>> order to do what I want.
>>>
>>> *Following is the code from my test application:*
>>> json_object = json.loads(response.text)
>>> sc = SparkContext("local", appName="JSON to RDD")
>>> sqlContext = SQLContext(sc)
>>> dataframe = sqlContext.read.json(json_object)
>>> dataframe.show()
>>>
>>> *The problem is that when I run **"spark-submit myExample.py" I got the
>>> following error:*
>>> 15/09/29 01:18:54 INFO BlockManagerMasterEndpoint: Registering block
>>> manager localhost:48634 with 530.0 MB RAM, BlockManagerId(driver,
>>> localhost, 48634)
>>> 15/09/29 01:18:54 INFO BlockManagerMaster: Registered BlockManager
>>> Traceback (most recent call last):
>>>   File "/home/paladini/ufxc/lisha/learning/spark-api-kairos/test1.py",
>>> line 35, in <module>
>>>     dataframe = sqlContext.read.json(json_object)
>>>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>>> line 144, in json
>>>   File
>>> "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
>>> 538, in __call__
>>>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
>>> 36, in deco
>>>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>>> line 304, in get_return_value
>>> py4j.protocol.Py4JError: An error occurred while calling o21.json. Trace:
>>> py4j.Py4JException: Method json([class java.util.HashMap]) does not exist
>>>     at
>>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>>>     at
>>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
>>>     at py4j.Gateway.invoke(Gateway.java:252)
>>>     at
>>> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>>>     at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>     at py4j.GatewayConnection.run(GatewayConnection.java:207)
>>>     at java.lang.Thread.run(Thread.java:745)
>>>
>>> *What I'm doing wrong? *
>>> Check out this gist
>>> <https://gist.github.com/paladini/2e2ea913d545a407b842> to see the JSON
>>> I'm trying to load.
>>>
>>> Thanks!
>>> Fernando Paladini
>>>
>>
>>
>


-- 
Fernando Paladini

Reply via email to