Re: SparkContext & Threading

2015-06-05 Thread Will Briggs
Your lambda expressions on the RDDs in the SecondRollup class are closing around the context, and Spark has special logic to ensure that all variables in a closure used on an RDD are Serializable - I hate linking to Quora, but there's a good explanation here: http://www.quora.com/What-does-Clos

Re: SparkContext & Threading

2015-06-06 Thread Will Briggs
, Lee McFadden wrote: On Fri, Jun 5, 2015 at 2:05 PM Will Briggs wrote: Your lambda expressions on the RDDs in the SecondRollup class are closing around the context, and Spark has special logic to ensure that all variables in a closure used on an RDD are Serializable - I hate linking to Quora

Re: write multiple outputs by key

2015-06-06 Thread Will Briggs
I believe groupByKey currently requires that all items for a specific key fit into a single and executive's memory: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html This previous discussion has some pointers if you must

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Will Briggs
To be fair, this is a long-standing issue due to optimizations for object reuse in the Hadoop API, and isn't necessarily a failing in Spark - see this blog post (https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/) from 2011 documenting a

Re: How to split log data into different files according to severity

2015-06-13 Thread Will Briggs
Check out this recent post by Cheng Liam regarding dynamic partitioning in Spark 1.4: https://www.mail-archive.com/user@spark.apache.org/msg30204.html On June 13, 2015, at 5:41 AM, Hao Wang wrote: Hi, I have a bunch of large log files on Hadoop. Each line contains a log and its severity. Is

Re: Dataframe Write : Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.

2015-06-13 Thread Will Briggs
The context that is created by spark-shell is actually an instance of HiveContext. If you want to use it programmatically in your driver, you need to make sure that your context is a HiveContext, and not a SQLContext. https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables H

Re: creation of RDD from a Tree

2015-06-14 Thread Will Briggs
If you are working on large structures, you probably want to look at the GraphX extension to Spark: https://spark.apache.org/docs/latest/graphx-programming-guide.html On June 14, 2015, at 10:50 AM, lisp wrote: Hi there, I have a large amount of objects, which I have to partition into chunks w

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Will Briggs
In general, you should avoid making direct changes to the Spark source code. If you are using Scala, you can seamlessly blend your own methods on top of the base RDDs using implicit conversions. Regards, Will On June 16, 2015, at 7:53 PM, raggy wrote: I am trying to submit a spark application

Re: Spark or Storm

2015-06-16 Thread Will Briggs
The programming models for the two frameworks are conceptually rather different; I haven't worked with Storm for quite some time, but based on my old experience with it, I would equate Spark Streaming more with Storm's Trident API, rather than with the raw Bolt API. Even then, there are signific

Re: Submitting Spark Applications using Spark Submit

2015-06-16 Thread Will Briggs
duce(). A member on here suggested I make the change in RDD.scala to accomplish that. Also, this is for a research project, and not for commercial use. So, any advice on how I can get the spark submit to use my custom built jars would be very useful. Thanks, Raghav > On Jun 16, 2015, at 6:57 PM,

Re: Using Accumulators in Streaming

2015-06-21 Thread Will Briggs
It sounds like accumulators are not necessary in Spark Streaming - see this post ( http://apache-spark-user-list.1001560.n3.nabble.com/Shared-variable-in-Spark-Streaming-td11762.html) for more details. On June 21, 2015, at 7:31 PM, anshu shukla wrote: In spark Streaming ,Since we are already

Re: Kryo fails to serialise output

2015-07-03 Thread Will Briggs
Kryo serialization is used internally by Spark for spilling or shuffling intermediate results, not for writing out an RDD as an action. Look at Sandy Ryza's examples for some hints on how to do this: https://github.com/sryza/simplesparkavroapp Regards, Will On July 3, 2015, at 2:45 AM, Dominik

Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-05 Thread Will Briggs
That code doesn't appear to be registering classes with Kryo, which means the fully-qualified classname is stored with every Kryo record. The Spark documentation has more on this: https://spark.apache.org/docs/latest/tuning.html#data-serialization Regards, Will On July 5, 2015, at 2:31 AM, Gav