[Spark Structured Streaming] Metrics for latency or performance of checkpointing

2019-02-23 Thread subramgr
Hi all, We are using Spark 2.3 and we are facing a weird issue. The streaming job works perfectly fine in one environment but in an another environment it does not. We feel that in low performing environment the checkpointing of the state is not as performant as the other. Is there any metric

Re: Spark 2.4 partitions and tasks

2019-02-23 Thread Yeikel
I am following up on this question because I have a similar issue. When is that we need to control the parallelism manually? Skewed partitions? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscri

RE: Difference between Typed and untyped transformation in dataset API

2019-02-23 Thread email
>From what I understand , if the transformation is untyped it will return a >Dataframe , otherwise it will return a Dataset. In the source code you will >see that return type is a Dataframe instead of a Dataset and they should also >be annotated with @group untypedrel. Thus , you could check th

RE: How can I parse an "unnamed" json array present in a column?

2019-02-23 Thread email
What you suggested works in Spark 2.3 , but in the version that I am using (2.1) it produces the following exception : found : org.apache.spark.sql.types.ArrayType required: org.apache.spark.sql.types.StructType ds.select(from_json($"news", schema) as "news_parsed").show(false)

Re: How can I parse an "unnamed" json array present in a column?

2019-02-23 Thread Magnus Nilsson
Use spark.sql.types.ArrayType instead of a Scala Array as the root type when you define the schema and it will work. Regards, Magnus On Fri, Feb 22, 2019 at 11:15 PM Yeikel wrote: > I have an "unnamed" json array stored in a *column*. > > The format is the following : > > column name : news >