Hi,
We can use CombineByKey to achieve this.
val finalRDD = tempRDD.combineByKey((x: (Any, Any)) => (x),(acc: (Any, Any), x)
=> (acc, x),(acc1: (Any, Any), acc2: (Any, Any)) => (acc1, acc2))
finalRDD.collect.foreach(println)
(amazon,((book1, tech),(book2,tech)))(barns&noble, (book,tech))(eBa
On Wed, Mar 30, 2016 at 4:33 AM, Steve Loughran wrote:
>
>> On 29 Mar 2016, at 22:19, Michael Segel wrote:
>>
>> Hi,
>>
>> So yeah, I know that Spark jobs running on a Hadoop cluster will inherit its
>> security from the underlying YARN job.
>> However… that’s not really saying much when you thi
Some answers and more questions inline
- UDFs can pretty much only take in Primitives, Seqs, Maps and Row objects
> as parameters. I cannot take in a case class object in place of the
> corresponding Row object, even if the schema matches because the Row object
> will always be passed in at Runtim
+1 to Matei's reasoning.
On Wed, Mar 30, 2016 at 9:21 AM, Matei Zaharia
wrote:
> I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the
> entire 2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's
> the default version we built with in 1.x. We want to make the tran
One clarification: there *are* Python interpreters running on executors so
that Python UDFs and RDD API code can be executed. Some slightly-outdated
but mostly-correct reference material for this can be found at
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals.
See also: search
Just to clarify, this is possible via UDF1/2/3 etc and registering those
with the desired return schema. It just felt wrong that the only way to do
this in scala was to use these classes which were in the Java package.
Maybe the relevant question is, why are these in a Java package?
On Wed, Mar 30
I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the entire
2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's the default
version we built with in 1.x. We want to make the transition from 1.x to 2.0 as
easy as possible. In 2.0, we'll have the default downloads
oh wow, had no idea it got ripped out
On Wed, Mar 30, 2016 at 11:50 AM, Mark Hamstra
wrote:
> No, with 2.0 Spark really doesn't use Akka:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L744
>
> On Wed, Mar 30, 2016 at 9:10 AM, Koert Kuipers wr
My concern is that for some of those stuck using 2.10 because of some
library dependency, three months isn't sufficient time to refactor their
infrastructure to be compatible with Spark 2.0.0 if that requires Scala
2.11. The additional 3-6 months would make it much more feasible for those
users to
No, with 2.0 Spark really doesn't use Akka:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L744
On Wed, Mar 30, 2016 at 9:10 AM, Koert Kuipers wrote:
> Spark still runs on akka. So if you want the benefits of the latest akka
> (not saying we do,
Hi all,
I've been trying for the last couple of days to define a UDF which takes in
a deeply nested Row object and performs some extraction to pull out a
portion of of the Row and return it. This row object is nested not just
with StructTypes but a bunch of ArrayTypes and MapTypes. From this compl
Spark still runs on akka. So if you want the benefits of the latest akka
(not saying we do, was just an example) then you need to drop scala 2.10
On Mar 30, 2016 10:44 AM, "Cody Koeninger" wrote:
> I agree with Mark in that I don't see how supporting scala 2.10 for
> spark 2.0 implies supporting
Yeah it is not crazy to drop support for something foundational like this
in a feature release but is something ideally coupled to a major release.
You could at least say it is probably a decision to keep supporting through
the end of the year given how releases are likely to go. Given the
availabi
I agree with Mark in that I don't see how supporting scala 2.10 for
spark 2.0 implies supporting it for all of spark 2.x
Regarding Koert's comment on akka, I thought all akka dependencies
have been removed from spark after SPARK-7997 and the recent removal
of external/akka
On Wed, Mar 30, 2016 at
Dropping Scala 2.10 support has to happen at some point, so I'm not
fundamentally opposed to the idea; but I've got questions about how we go
about making the change and what degree of negative consequences we are
willing to accept. Until now, we have been saying that 2.10 support will
be continue
about that pro, i think it's more the opposite: many libraries have
stopped maintaining scala 2.10 versions. bugs will no longer be fixed for
scala 2.10 and new libraries will not be available for scala 2.10 at all,
making them unusable in spark.
take for example akka, a distributed messaging l
Maybe the question should be how far back should spark be compatible?
There is nothings stopping people to run spark 1.6.x with jdk 7 or scala 2.10
or Hadoop <2.6
But if they want spark 2.x they should consider a migration to jdk8 and scala
2.11
Or am I getting it all wrong?
Raymond Honderdo
Steve, those are good points, I had forgotten Hadoop had those issues. We
run with jdk 8, hadoop is built for jdk7 compatibility, we are running hadoop
2.7 on our clusters and by the time Spark 2.0 is out I would expected a mix of
Hadoop 2.7 and 2.8. We also don't use spnego.
I didn't quite
(This should fork as its own thread, though it began during discussion
of whether to continue Java 7 support in Spark 2.x.)
Simply: would like to more clearly take the temperature of all
interested parties about whether to support Scala 2.10 in the Spark
2.x lifecycle. Some of the arguments appear
Can I note that if Spark 2.0 is going to be Java 8+ only, then that means
Hadoop 2.6.x should be the minimum Hadoop version.
https://issues.apache.org/jira/browse/HADOOP-11090
Where things get complicated, is that situation of: Hadoop services on Java 7,
Spark on Java 8 in its own JVM
I'm not
> On 29 Mar 2016, at 22:19, Michael Segel wrote:
>
> Hi,
>
> So yeah, I know that Spark jobs running on a Hadoop cluster will inherit its
> security from the underlying YARN job.
> However… that’s not really saying much when you think about some use cases.
>
> Like using the thrift service
On 30 Mar 2016, at 04:44, Selvam Raman
mailto:sel...@gmail.com>> wrote:
Hi,
i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine.
i was trying to use databricks csv format to read csv file. i used the below
command.
I got null pointer exception. Any help would be grea
Have a look at
http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211
Thanks
Best Regards
On Wed, Mar 30, 2016 at 12:09 AM, satyajit vegesna <
satyajit.apas...@gmail.com> wrote:
>
> Hi All,
>
> I have written a spark program on my dev box ,
>IDE:Intellij
>
I don't think at this point there is anything on security beyond what is
written here already http://spark.apache.org/docs/latest/security.html
Thanks
Best Regards
On Wed, Mar 30, 2016 at 4:19 AM, Michael Segel
wrote:
> Hi,
>
> So yeah, I know that Spark jobs running on a Hadoop cluster will in
Isn't it what tempRDD.groupByKey does?
Thanks
Best Regards
On Wed, Mar 30, 2016 at 7:36 AM, Suniti Singh
wrote:
> Hi All,
>
> I have an RDD having the data in the following form :
>
> tempRDD: RDD[(String, (String, String))]
>
> (brand , (product, key))
>
> ("amazon",("book1","tech"))
>
> ("eB
25 matches
Mail list logo