Re: Flatten JSON to multiple columns in Spark

lucas.g...@gmail.com Tue, 18 Jul 2017 11:14:52 -0700

That's a great link Michael, thanks!

For us it was around attempting to provide for dynamic schemas which is a
bit of an anti-pattern.


Ultimately it just comes down to owning your transforms, all the basic
tools are there.



On 18 July 2017 at 11:03, Michael Armbrust <mich...@databricks.com> wrote:

> Here is an overview of how to work with complex JSON in Spark:
> https://databricks.com/blog/2017/02/23/working-complex-data-formats-
> structured-streaming-apache-spark-2-1.html (works in streaming and batch)
>
> On Tue, Jul 18, 2017 at 10:29 AM, Riccardo Ferrari <ferra...@gmail.com>
> wrote:
>
>> What's against:
>>
>> df.rdd.map(...)
>>
>> or
>>
>> dataset.foreach()
>>
>> https://spark.apache.org/docs/2.0.1/api/scala/index.html#org
>> .apache.spark.sql.Dataset@foreach(f:T=>Unit):Unit
>>
>> Best,
>>
>> On Tue, Jul 18, 2017 at 6:46 PM, lucas.g...@gmail.com <
>> lucas.g...@gmail.com> wrote:
>>
>>> I've been wondering about this for awhile.
>>>
>>> We wanted to do something similar for generically saving thousands of
>>> individual homogenous events into well formed parquet.
>>>
>>> Ultimately I couldn't find something I wanted to own and pushed back on
>>> the requirements.
>>>
>>> It seems the canonical answer is that you need to 'own' the schema of
>>> the json and parse it out manually and into your dataframe.  There's
>>> nothing challenging about it.  Just verbose code.  If you're 'info' is a
>>> consistent schema then you'll be fine.  For us it was 12 wildly diverging
>>> schemas and I didn't want to own the transforms.
>>>
>>> I also recommend persisting anything that isn't part of your schema in
>>> an 'extras field'  So when you parse out your json, if you've got anything
>>> leftover drop it in there for later analysis.
>>>
>>> I can provide some sample code but I think it's pretty straightforward /
>>> you can google it.
>>>
>>> What you can't seem to do efficiently is dynamically generate a
>>> dataframe from random JSON.
>>>
>>>
>>> On 18 July 2017 at 01:57, Chetan Khatri <chetan.opensou...@gmail.com>
>>> wrote:
>>>
>>>> Implicit tried - didn't worked!
>>>>
>>>> from_json - didnt support spark 2.0.1 any alternate solution would be
>>>> welcome please
>>>>
>>>>
>>>> On Tue, Jul 18, 2017 at 12:18 PM, Georg Heiler <
>>>> georg.kf.hei...@gmail.com> wrote:
>>>>
>>>>> You need to have spark implicits in scope
>>>>> Richard Xin <richardxin...@yahoo.com.invalid> schrieb am Di. 18. Juli
>>>>> 2017 um 08:45:
>>>>>
>>>>>> I believe you could use JOLT (bazaarvoice/jolt
>>>>>> <https://github.com/bazaarvoice/jolt>) to flatten it to a json
>>>>>> string and then to dataframe or dataset.
>>>>>>
>>>>>> bazaarvoice/jolt
>>>>>>
>>>>>> jolt - JSON to JSON transformation library written in Java.
>>>>>> <https://github.com/bazaarvoice/jolt>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Monday, July 17, 2017, 11:18:24 PM PDT, Chetan Khatri <
>>>>>> chetan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>> Explode is not working in this scenario with error - string cannot be
>>>>>> used in explore either array or map in spark
>>>>>> On Tue, Jul 18, 2017 at 11:39 AM, 刘虓 <ipf...@gmail.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>> have you tried to use explode?
>>>>>>
>>>>>> Chetan Khatri <chetan.opensou...@gmail.com> 于2017年7月18日 周二下午2:06写道：
>>>>>>
>>>>>> Hello Spark Dev's,
>>>>>>
>>>>>> Can you please guide me, how to flatten JSON to multiple columns in
>>>>>> Spark.
>>>>>>
>>>>>> *Example:*
>>>>>>
>>>>>> Sr No Title ISBN Info
>>>>>> 1 Calculus Theory 1234567890
>>>>>>
>>>>>> [{"cert":[{
>>>>>> "authSbmtr":"009415da-c8cd- 418d-869e-0a19601d79fa",
>>>>>> 009415da-c8cd-418d-869e- 0a19601d79fa
>>>>>> "certUUID":"03ea5a1a-5530- 4fa3-8871-9d1ebac627c4",
>>>>>>
>>>>>> "effDt":"2016-05-06T15:04:56. 279Z",
>>>>>>
>>>>>>
>>>>>> "fileFmt":"rjrCsv","status":" live"}],
>>>>>>
>>>>>> "expdCnt":"15",
>>>>>> "mfgAcctNum":"531093",
>>>>>>
>>>>>> "oUUID":"23d07397-4fbe-4897- 8a18-b79c9f64726c",
>>>>>>
>>>>>>
>>>>>> "pgmRole":["RETAILER"],
>>>>>> "pgmUUID":"1cb5dd63-817a-45bc- a15c-5660e4accd63",
>>>>>> "regUUID":"cc1bd898-657d-40dc- af5d-4bf1569a1cc4",
>>>>>> "rtlrsSbmtd":["009415da-c8cd- 418d-869e-0a19601d79fa"]}]
>>>>>>
>>>>>> I want to get single row with 11 columns.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Flatten JSON to multiple columns in Spark

Reply via email to