Re: Confusing RDD function

2016-03-08 Thread Hemminger Jeff
ransformation and thus is not > actually applied until some action (like 'foreach') is called on the > resulting RDD. > You can find more information in the Spark Programming Guide > http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations. > > best, > --

Confusing RDD function

2016-03-08 Thread Hemminger Jeff
I'm currently developing a Spark Streaming application. I have a function that receives an RDD and an object instance as a parameter, and returns an RDD: def doTheThing(a: RDD[A], b: B): RDD[C] Within the function, I do some processing within a map of the RDD. Like this: def doTheThing(a: RD

Re: String operation in filter with a special character

2015-10-05 Thread Hemminger Jeff
names. > > On Mon, Oct 5, 2015 at 12:59 AM, Hemminger Jeff wrote: > >> I have a rather odd use case. I have a DataFrame column name with a + >> value in it. >> The app performs some processing steps before determining the column >> name, and it >> would be

Re: spark-ec2 config files.

2015-10-05 Thread Hemminger Jeff
The spark-ec2 script generates spark config files from templates. Those are located here: https://github.com/amplab/spark-ec2/tree/branch-1.5/templates/root/spark/conf Note the link is referring to the 1.5 branch. Is this what you are looking for? Jeff On Mon, Oct 5, 2015 at 8:56 AM, Renato Perini

String operation in filter with a special character

2015-10-04 Thread Hemminger Jeff
I have a rather odd use case. I have a DataFrame column name with a + value in it. The app performs some processing steps before determining the column name, and it would be much easier to code if I could use the DataFrame filter operations with a String. This demonstrates the issue I am having:

What happens when cache is full?

2015-09-12 Thread Hemminger Jeff
I am trying to understand the process of caching and specifically what the behavior is when the cache is full. Please excuse me if this question is a little vague, I am trying to build my understanding of this process. I have an RDD that I perform several computations with, I persist it with IN_ME

Re: Alternative to Large Broadcast Variables

2015-08-29 Thread Hemminger Jeff
;>> >>> On Fri, Aug 28, 2015 at 12:44 PM, Jason wrote: >>> >>>> You could try using an external key value store (like HBase, Redis) and >>>> perform lookups/updates inside of your mappers (you'd need to create the >>>> connection wit

Alternative to Large Broadcast Variables

2015-08-28 Thread Hemminger Jeff
Hi, I am working on a Spark application that is using of a large (~3G) broadcast variable as a lookup table. The application refines the data in this lookup table in an iterative manner. So this large variable is broadcast many times during the lifetime of the application process. >From what I ha