Open source project: Example Spark project using Parquet as a columnar store with Thrift objects.

2014-08-13 Thread bdamos
Hi Spark community, We're excited about Spark at Adobe Research and have just open sourced an example project writing and reading Thrift objects to Parquet with Spark. The project is on GitHub, and we're happy for any feedback: https://github.com/adobe-research/spark-parquet-thrift-example Regar

Open source project: Deploy Spark to a cluster with Puppet and Fabric.

2014-08-13 Thread bdamos
Hi Spark community, We're excited about Spark at Adobe Research and have just open sourced a project we use to automatically provision a Spark cluster and submit applications. The project is on GitHub, and we're happy for any feedback from the community: https://github.com/adobe-research/spark-clu

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread bdamos
Sean Owen-2 wrote > Can you not just filter the range you want, then groupBy > timestamp/86400 ? That sounds like your solution 1 and is about as > fast as it gets, I think. Are you thinking you would have to filter > out each day individually from there, and that's why it would be slow? > I don't

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread bdamos
ssimanta wrote >> Solution 2 is to map the objects into a pair RDD where the >> key is the number of the day in the interval, then group by >> key, collect, and parallelize the resulting grouped data. >> However, I worry collecting large data sets is going to be >> a serious performance bottleneck.

How to separate a subset of an RDD by day?

2014-07-11 Thread bdamos
Hi, I have an RDD that represents data over a time interval and I want to select some subinterval of my data and partition it by day based on a unix time field in the data. What is the best way to do this with Spark? I have currently implemented 2 solutions, both which seem suboptimal. Solution 1

Re: NullPointerExceptions when using val or broadcast on a standalone cluster.

2014-06-17 Thread bdamos
Hi, I think this is a bug in Spark, because changing my program to using a main method instead of using the App trait fixes this problem. I've filed this as SPARK-2175, apologies if this turns out to be a duplicate. https://issues.apache.org/jira/browse/SPARK-2175 Regards, Brandon. -- View thi

NullPointerExceptions when using val or broadcast on a standalone cluster.

2014-06-12 Thread bdamos
Hi, I'm consistently getting NullPointerExceptions when trying to use String val objects defined in my main application -- even for broadcast vals! I'm deploying on a standalone cluster with a master and 4 workers on the same machine, which is not the machine I'm submitting from. The following exa