Re: Flatten JSON to multiple columns in Spark

Chetan Khatri Wed, 19 Jul 2017 20:46:51 -0700

Thank you Damji / All for guide. I made Schema according to my JSON, can
you correct me is it correct schema:


*JSON String*

[{"cert":[{
"authSbmtr":"009415da-c8cd-418d-869e-0a19601d79fa",
009415da-c8cd-418d-869e-0a19601d79fa
"certUUID":"03ea5a1a-5530-4fa3-8871-9d1ebac627c4",
"effDt":"2016-05-06T15:04:56.279Z",
"fileFmt":"rjrCsv","status":"live"}],
"expdCnt":"15",
"mfgAcctNum":"531093",
"oUUID":"23d07397-4fbe-4897-8a18-b79c9f64726c",
"pgmRole":["RETAILER"],
"pgmUUID":"1cb5dd63-817a-45bc-a15c-5660e4accd63",
"regUUID":"cc1bd898-657d-40dc-af5d-4bf1569a1cc4",
"rtlrsSbmtd":["009415da-c8cd-418d-869e-0a19601d79fa"]}]

*Try1*

val schema = new ArrayType([
            StructType("cert", new ArrayType(StructType([
             StructField("authSbmtr", StringType(), True),
             StructField("certUUID", StringType(), True),
             StructField("effDt", StringType(), True),
             StructField("fileFmt", StringType(), True)
            ])), StructField("expdCnt", StringType(),True)
            StructField("mfgAcctNum", StringType(), True),
            StructField("oUUID", StringType(), True),
            StructField("pgmRole", StringType(), True),
            StructField("pgmUUID", StringType(), True),
            StructField("regUUID", StringType(), True),
            StructField("rtlrsSbmtd", StringType(), True))])


*try2*

val schema = new StructType().add("cert",
MapType(StringType, new StructType()
.add("authSbmtr", StringType)
.add("certUUID", StringType)
.add("effDt", StringType)
.add("fileFmt", StringType)
))
.add("expdCnt", StringType)
.add("mfgAcctNum", StringType)
.add("oUUID", StringType)
.add("pgmRole", StringType)
.add("pgmUUID", StringType)
.add("regUUID", StringType)
.add("rtlrsSbmtd", StringType)

On Wed, Jul 19, 2017 at 6:42 PM, Jules Damji <ju...@databricks.com> wrote:

>
> Another tutorial that complements and shows how to work and extract data
> from nested JSON columns: https://databricks.com/blog/2017/06/13/five-
> spark-sql-utility-functions-extract-explore-complex-data-types.html
>
> Cheers,
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
>
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> On Jul 19, 2017, at 3:00 AM, Chetan Khatri <chetan.opensou...@gmail.com>
> wrote:
>
> As i am beginner, if some one can give psuedocode would be highly
> appreciated
>
> On Tue, Jul 18, 2017 at 11:43 PM, lucas.g...@gmail.com <
> lucas.g...@gmail.com> wrote:
>
>> That's a great link Michael, thanks!
>>
>> For us it was around attempting to provide for dynamic schemas which is a
>> bit of an anti-pattern.
>>
>> Ultimately it just comes down to owning your transforms, all the basic
>> tools are there.
>>
>>
>>
>> On 18 July 2017 at 11:03, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> Here is an overview of how to work with complex JSON in Spark:
>>> https://databricks.com/blog/2017/02/23/working-comple
>>> x-data-formats-structured-streaming-apache-spark-2-1.html (works in
>>> streaming and batch)
>>>
>>> On Tue, Jul 18, 2017 at 10:29 AM, Riccardo Ferrari <ferra...@gmail.com>
>>> wrote:
>>>
>>>> What's against:
>>>>
>>>> df.rdd.map(...)
>>>>
>>>> or
>>>>
>>>> dataset.foreach()
>>>>
>>>> https://spark.apache.org/docs/2.0.1/api/scala/index.html#org
>>>> .apache.spark.sql.Dataset@foreach(f:T=>Unit):Unit
>>>>
>>>> Best,
>>>>
>>>> On Tue, Jul 18, 2017 at 6:46 PM, lucas.g...@gmail.com <
>>>> lucas.g...@gmail.com> wrote:
>>>>
>>>>> I've been wondering about this for awhile.
>>>>>
>>>>> We wanted to do something similar for generically saving thousands of
>>>>> individual homogenous events into well formed parquet.
>>>>>
>>>>> Ultimately I couldn't find something I wanted to own and pushed back
>>>>> on the requirements.
>>>>>
>>>>> It seems the canonical answer is that you need to 'own' the schema of
>>>>> the json and parse it out manually and into your dataframe.  There's
>>>>> nothing challenging about it.  Just verbose code.  If you're 'info' is a
>>>>> consistent schema then you'll be fine.  For us it was 12 wildly diverging
>>>>> schemas and I didn't want to own the transforms.
>>>>>
>>>>> I also recommend persisting anything that isn't part of your schema in
>>>>> an 'extras field'  So when you parse out your json, if you've got anything
>>>>> leftover drop it in there for later analysis.
>>>>>
>>>>> I can provide some sample code but I think it's pretty straightforward
>>>>> / you can google it.
>>>>>
>>>>> What you can't seem to do efficiently is dynamically generate a
>>>>> dataframe from random JSON.
>>>>>
>>>>>
>>>>> On 18 July 2017 at 01:57, Chetan Khatri <chetan.opensou...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Implicit tried - didn't worked!
>>>>>>
>>>>>> from_json - didnt support spark 2.0.1 any alternate solution would be
>>>>>> welcome please
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 18, 2017 at 12:18 PM, Georg Heiler <
>>>>>> georg.kf.hei...@gmail.com> wrote:
>>>>>>
>>>>>>> You need to have spark implicits in scope
>>>>>>> Richard Xin <richardxin...@yahoo.com.invalid> schrieb am Di. 18.
>>>>>>> Juli 2017 um 08:45:
>>>>>>>
>>>>>>>> I believe you could use JOLT (bazaarvoice/jolt
>>>>>>>> <https://github.com/bazaarvoice/jolt>) to flatten it to a json
>>>>>>>> string and then to dataframe or dataset.
>>>>>>>>
>>>>>>>> bazaarvoice/jolt
>>>>>>>>
>>>>>>>> jolt - JSON to JSON transformation library written in Java.
>>>>>>>> <https://github.com/bazaarvoice/jolt>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Monday, July 17, 2017, 11:18:24 PM PDT, Chetan Khatri <
>>>>>>>> chetan.opensou...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Explode is not working in this scenario with error - string cannot
>>>>>>>> be used in explore either array or map in spark
>>>>>>>> On Tue, Jul 18, 2017 at 11:39 AM, 刘虓 <ipf...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>> have you tried to use explode?
>>>>>>>>
>>>>>>>> Chetan Khatri <chetan.opensou...@gmail.com> 于2017年7月18日 周二下午2:06写道：
>>>>>>>>
>>>>>>>> Hello Spark Dev's,
>>>>>>>>
>>>>>>>> Can you please guide me, how to flatten JSON to multiple columns in
>>>>>>>> Spark.
>>>>>>>>
>>>>>>>> *Example:*
>>>>>>>>
>>>>>>>> Sr No Title ISBN Info
>>>>>>>> 1 Calculus Theory 1234567890
>>>>>>>>
>>>>>>>> [{"cert":[{
>>>>>>>> "authSbmtr":"009415da-c8cd- 418d-869e-0a19601d79fa",
>>>>>>>> 009415da-c8cd-418d-869e- 0a19601d79fa
>>>>>>>> "certUUID":"03ea5a1a-5530- 4fa3-8871-9d1ebac627c4",
>>>>>>>>
>>>>>>>> "effDt":"2016-05-06T15:04:56. 279Z",
>>>>>>>>
>>>>>>>>
>>>>>>>> "fileFmt":"rjrCsv","status":" live"}],
>>>>>>>>
>>>>>>>> "expdCnt":"15",
>>>>>>>> "mfgAcctNum":"531093",
>>>>>>>>
>>>>>>>> "oUUID":"23d07397-4fbe-4897- 8a18-b79c9f64726c",
>>>>>>>>
>>>>>>>>
>>>>>>>> "pgmRole":["RETAILER"],
>>>>>>>> "pgmUUID":"1cb5dd63-817a-45bc- a15c-5660e4accd63",
>>>>>>>> "regUUID":"cc1bd898-657d-40dc- af5d-4bf1569a1cc4",
>>>>>>>> "rtlrsSbmtd":["009415da-c8cd- 418d-869e-0a19601d79fa"]}]
>>>>>>>>
>>>>>>>> I want to get single row with 11 columns.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Flatten JSON to multiple columns in Spark

Reply via email to