Your lambda expressions on the RDDs in the SecondRollup class are closing
around the context, and Spark has special logic to ensure that all variables in
a closure used on an RDD are Serializable - I hate linking to Quora, but
there's a good explanation here:
http://www.quora.com/What-does-Clos
, Lee McFadden wrote:
On Fri, Jun 5, 2015 at 2:05 PM Will Briggs wrote:
Your lambda expressions on the RDDs in the SecondRollup class are closing
around the context, and Spark has special logic to ensure that all variables in
a closure used on an RDD are Serializable - I hate linking to Quora
I believe groupByKey currently requires that all items for a specific key fit
into a single and executive's memory:
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
This previous discussion has some pointers if you must
To be fair, this is a long-standing issue due to optimizations for object reuse
in the Hadoop API, and isn't necessarily a failing in Spark - see this blog
post
(https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/)
from 2011 documenting a
Check out this recent post by Cheng Liam regarding dynamic partitioning in
Spark 1.4: https://www.mail-archive.com/user@spark.apache.org/msg30204.html
On June 13, 2015, at 5:41 AM, Hao Wang wrote:
Hi,
I have a bunch of large log files on Hadoop. Each line contains a log and its
severity. Is
The context that is created by spark-shell is actually an instance of
HiveContext. If you want to use it programmatically in your driver, you need to
make sure that your context is a HiveContext, and not a SQLContext.
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
H
If you are working on large structures, you probably want to look at the GraphX
extension to Spark:
https://spark.apache.org/docs/latest/graphx-programming-guide.html
On June 14, 2015, at 10:50 AM, lisp wrote:
Hi there,
I have a large amount of objects, which I have to partition into chunks w
In general, you should avoid making direct changes to the Spark source code. If
you are using Scala, you can seamlessly blend your own methods on top of the
base RDDs using implicit conversions.
Regards,
Will
On June 16, 2015, at 7:53 PM, raggy wrote:
I am trying to submit a spark application
The programming models for the two frameworks are conceptually rather
different; I haven't worked with Storm for quite some time, but based on my old
experience with it, I would equate Spark Streaming more with Storm's Trident
API, rather than with the raw Bolt API. Even then, there are signific
duce(). A member
on here suggested I make the change in RDD.scala to accomplish that. Also, this
is for a research project, and not for commercial use.
So, any advice on how I can get the spark submit to use my custom built jars
would be very useful.
Thanks,
Raghav
> On Jun 16, 2015, at 6:57 PM,
It sounds like accumulators are not necessary in Spark Streaming - see this
post (
http://apache-spark-user-list.1001560.n3.nabble.com/Shared-variable-in-Spark-Streaming-td11762.html)
for more details.
On June 21, 2015, at 7:31 PM, anshu shukla wrote:
In spark Streaming ,Since we are already
Kryo serialization is used internally by Spark for spilling or shuffling
intermediate results, not for writing out an RDD as an action. Look at Sandy
Ryza's examples for some hints on how to do this:
https://github.com/sryza/simplesparkavroapp
Regards,
Will
On July 3, 2015, at 2:45 AM, Dominik
That code doesn't appear to be registering classes with Kryo, which means the
fully-qualified classname is stored with every Kryo record. The Spark
documentation has more on this:
https://spark.apache.org/docs/latest/tuning.html#data-serialization
Regards,
Will
On July 5, 2015, at 2:31 AM, Gav
13 matches
Mail list logo