Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

Michael Armbrust Mon, 05 Oct 2015 12:53:49 -0700

Looks correct to me.  Try for example:

from pyspark.sql.functions import *
df.withColumn("value", explode(df['values'])).show()


On Mon, Oct 5, 2015 at 2:15 PM, Fernando Paladini <fnpalad...@gmail.com>
wrote:

> Update:
>
> I've updated my code and now I have the following JSON:
> https://gist.github.com/paladini/27bb5636d91dec79bd56
> In the same link you can check the output from "spark-submit
> myPythonScript.py", where I call "myDataframe.show()". The following is
> printed by Spark (among other useless debug information):
>
>
> 
> That's correct for the given JSON input
> <https://gist.github.com/paladini/27bb5636d91dec79bd56> (gist link
> above)? How can I test if Spark can understand this DataFrame and make
> complex manipulations with that?
>
> Thank you! Hope you can help me soon :3
> Fernando Paladini.
>
> 2015-10-05 15:23 GMT-03:00 Fernando Paladini <fnpalad...@gmail.com>:
>
>> Thank you for the replies and sorry about the delay, my e-mail client
>> send this conversation to Spam (??).
>>
>> I'll take a look in your tips and come back later to post my questions /
>> progress. Again, thank you so much!
>>
>> 2015-09-30 18:37 GMT-03:00 Michael Armbrust <mich...@databricks.com>:
>>
>>> I think the problem here is that you are passing in parsed JSON that
>>> stored as a dictionary (which is converted to a hashmap when going into the
>>> JVM).  You should instead be passing in the path to the json file
>>> (formatted as Akhil suggests) so that Spark can do the parsing in
>>> parallel.  The other option would be to construct and RDD of JSON string
>>> and pass that to the JSON method.
>>>
>>> On Wed, Sep 30, 2015 at 2:28 AM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> Each Json Doc should be in a single line i guess.
>>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>>>>
>>>> Note that the file that is offered as *a json file* is not a typical
>>>> JSON file. Each line must contain a separate, self-contained valid JSON
>>>> object. As a consequence, a regular multi-line JSON file will most often
>>>> fail.
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Tue, Sep 29, 2015 at 11:07 AM, Fernando Paladini <
>>>> fnpalad...@gmail.com> wrote:
>>>>
>>>>> Hello guys,
>>>>>
>>>>> I'm very new to Spark and I'm having some troubles when reading a JSON
>>>>> to dataframe on PySpark.
>>>>>
>>>>> I'm getting a JSON object from an API response and I would like to
>>>>> store it in Spark as a DataFrame (I've read that DataFrame is better than
>>>>> RDD, that's accurate?). For what I've read
>>>>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext>
>>>>> on documentation, I just need to call the method sqlContext.read.json in
>>>>> order to do what I want.
>>>>>
>>>>> *Following is the code from my test application:*
>>>>> json_object = json.loads(response.text)
>>>>> sc = SparkContext("local", appName="JSON to RDD")
>>>>> sqlContext = SQLContext(sc)
>>>>> dataframe = sqlContext.read.json(json_object)
>>>>> dataframe.show()
>>>>>
>>>>> *The problem is that when I run **"spark-submit myExample.py" I got
>>>>> the following error:*
>>>>> 15/09/29 01:18:54 INFO BlockManagerMasterEndpoint: Registering block
>>>>> manager localhost:48634 with 530.0 MB RAM, BlockManagerId(driver,
>>>>> localhost, 48634)
>>>>> 15/09/29 01:18:54 INFO BlockManagerMaster: Registered BlockManager
>>>>> Traceback (most recent call last):
>>>>>   File "/home/paladini/ufxc/lisha/learning/spark-api-kairos/test1.py",
>>>>> line 35, in <module>
>>>>>     dataframe = sqlContext.read.json(json_object)
>>>>>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>>>>> line 144, in json
>>>>>   File
>>>>> "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
>>>>> 538, in __call__
>>>>>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
>>>>> 36, in deco
>>>>>   File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>>>>> line 304, in get_return_value
>>>>> py4j.protocol.Py4JError: An error occurred while calling o21.json.
>>>>> Trace:
>>>>> py4j.Py4JException: Method json([class java.util.HashMap]) does not
>>>>> exist
>>>>>     at
>>>>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>>>>>     at
>>>>> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
>>>>>     at py4j.Gateway.invoke(Gateway.java:252)
>>>>>     at
>>>>> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>>>>>     at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>>>     at py4j.GatewayConnection.run(GatewayConnection.java:207)
>>>>>     at java.lang.Thread.run(Thread.java:745)
>>>>>
>>>>> *What I'm doing wrong? *
>>>>> Check out this gist
>>>>> <https://gist.github.com/paladini/2e2ea913d545a407b842> to see the
>>>>> JSON I'm trying to load.
>>>>>
>>>>> Thanks!
>>>>> Fernando Paladini
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Fernando Paladini
>>
>
>
>
> --
> Fernando Paladini
>

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

Reply via email to