Re: Structured Stream equivalent of reduceByKey

2017-10-26 Thread Michael Armbrust
- dev

I think you should be able to write an Aggregator
.
You probably want to run in update mode if you are looking for it to output
any group that has changed in the batch.

On Wed, Oct 25, 2017 at 5:52 PM, Piyush Mukati 
wrote:

> Hi,
> we are migrating some jobs from Dstream to Structured Stream.
>
> Currently to handle aggregations we call map and reducebyKey on each RDD
> like
> rdd.map(event => (event._1, event)).reduceByKey((a, b) => merge(a, b))
>
> The final output of each RDD is merged to the sink with support for
> aggregation at the sink( Like co-processor at HBase ).
>
> In the new DataSet API, I am not finding any suitable API to aggregate
> over the micro-batch.
> Most of the aggregation API uses state-store and provide global
> aggregations. ( with append mode it does not give the change in existing
> buckets )
> Problems we are suspecting are :
>  1) state-store is tightly linked to the job definitions. while in our
> case we want may edit the job while keeping the older calculated aggregate
> as it is.
>
> The desired result can be achieved with below dataset APIs.
> dataset.groupByKey(a=>a._1).mapGroups( (key, valueItr) => merge(valueItr))
> while on observing the physical plan it does not call any merge before
> sort.
>
>  Anyone aware of API or other workarounds to get the desired result?
>


Spark-XML maintenance

2017-10-26 Thread comtef
I've used spark for a couple of years and I found a way to contribute to the
cause :).
I've found a blocker in Spark XML extension
(https://github.com/databricks/spark-xml). I'd like to know if this is the
right place to discuss issues about this extension?

I've opened a PR to adress this problem but it's been open for a few months
now without any review...



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark-XML maintenance

2017-10-26 Thread Jörn Franke
I would address databricks with this issue - it is their repository 

> On 26. Oct 2017, at 18:43, comtef  wrote:
> 
> I've used spark for a couple of years and I found a way to contribute to the
> cause :).
> I've found a blocker in Spark XML extension
> (https://github.com/databricks/spark-xml). I'd like to know if this is the
> right place to discuss issues about this extension?
> 
> I've opened a PR to adress this problem but it's been open for a few months
> now without any review...
> 
> 
> 
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kicking off the process around Spark 2.2.1

2017-10-26 Thread Felix Cheung
Yes! I can take on RM for 2.2.1.

We are still working out what to do with temp files created by Hive and Java 
that cause the policy issue with CRAN and will report back shortly, hopefully.


From: Sean Owen 
Sent: Wednesday, October 25, 2017 4:39:15 AM
To: Holden Karau
Cc: Felix Cheung; dev@spark.apache.org
Subject: Re: Kicking off the process around Spark 2.2.1

It would be reasonably consistent with the timing of other x.y.1 releases, and 
more release managers sounds useful, yeah.

Note also that in theory the code freeze for 2.3.0 starts in about 2 weeks.

On Wed, Oct 25, 2017 at 12:29 PM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Now that Spark 2.1.2 is out it seems like now is a good time to get started on 
the Spark 2.2.1 release. There are some streaming fixes I’m aware of that would 
be good to get into a release, is there anything else people are working on for 
2.2.1 we should be tracking?

To switch it up I’d like to suggest Felix to be the RM for this since there are 
also likely some R packaging changes to be included in the release. This also 
gives us a chance to see if my updated release documentation if enough for a 
new RM to get started from.

What do folks think?
--
Twitter: https://twitter.com/holdenkarau


Re: Spark-XML maintenance

2017-10-26 Thread Reynold Xin
Adding Hyukjin who has been maintaining it.

The easiest is probably to leave comments in the repo.

On Thu, Oct 26, 2017 at 9:44 AM Jörn Franke  wrote:

> I would address databricks with this issue - it is their repository
>
> > On 26. Oct 2017, at 18:43, comtef  wrote:
> >
> > I've used spark for a couple of years and I found a way to contribute to
> the
> > cause :).
> > I've found a blocker in Spark XML extension
> > (https://github.com/databricks/spark-xml). I'd like to know if this is
> the
> > right place to discuss issues about this extension?
> >
> > I've opened a PR to adress this problem but it's been open for a few
> months
> > now without any review...
> >
> >
> >
> > --
> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Anyone knows how to build and spark on jdk9?

2017-10-26 Thread Zhang, Liyun
Hi all:
1.   I want to build spark on jdk9 and test it with Hadoop on jdk9 env. I 
search for jiras related to JDK9. I only found 
SPARK-13278.  This means now 
spark can build or run successfully on JDK9 ?


Best Regards
Kelly Zhang/Zhang,Liyun



Re: Anyone knows how to build and spark on jdk9?

2017-10-26 Thread Reynold Xin
It probably depends on the Scala version we use in Spark supporting Java 9
first.

On Thu, Oct 26, 2017 at 7:22 PM Zhang, Liyun  wrote:

> Hi all:
>
> 1.   I want to build spark on jdk9 and test it with Hadoop on jdk9
> env. I search for jiras related to JDK9. I only found SPARK-13278
> .  This means now
> spark can build or run successfully on JDK9 ?
>
>
>
>
>
> Best Regards
>
> Kelly Zhang/Zhang,Liyun
>
>
>


RE: Anyone knows how to build and spark on jdk9?

2017-10-26 Thread Zhang, Liyun
Thanks your suggestion, seems that scala 2.12.4 support jdk9


Scala 2.12.4 is now 
available.

Our 
benchmarks
 show a further reduction in compile times since 2.12.3 of 5-10%.

Improved Java 9 friendliness, with more to come!

Best Regards
Kelly Zhang/Zhang,Liyun





From: Reynold Xin [mailto:r...@databricks.com]
Sent: Friday, October 27, 2017 10:26 AM
To: Zhang, Liyun ; dev@spark.apache.org; 
u...@spark.apache.org
Subject: Re: Anyone knows how to build and spark on jdk9?

It probably depends on the Scala version we use in Spark supporting Java 9 
first.

On Thu, Oct 26, 2017 at 7:22 PM Zhang, Liyun 
mailto:liyun.zh...@intel.com>> wrote:
Hi all:
1.   I want to build spark on jdk9 and test it with Hadoop on jdk9 env. I 
search for jiras related to JDK9. I only found 
SPARK-13278.  This means now 
spark can build or run successfully on JDK9 ?


Best Regards
Kelly Zhang/Zhang,Liyun