SparkSQL Dataframe : partitionColumn, lowerBound, upperBound, numPartitions in context of reading from MySQL

2016-03-30 Thread Soumya Simanta
I'm trying to understand what the following configurations mean and their implication on reading data from a MySQL table. I'm looking for options that will impact my read throughput when reading data from a large table. Thanks. partitionColumn, lowerBound, upperBound, numPartitions These op

Re: Yarn client mode: Setting environment variables

2016-02-17 Thread Soumya Simanta
Can you give some examples of what variables you are trying to set ? On Thu, Feb 18, 2016 at 1:01 AM, Lin Zhao wrote: > I've been trying to set some environment variables for the spark executors > but haven't had much like. I tried editting conf/spark-env.sh but it > doesn't get through to the

Re: Building Spark behind a proxy

2015-01-29 Thread Soumya Simanta
om > > On Thu, Jan 29, 2015 at 11:35 AM, Soumya Simanta > wrote: > >> >> >> On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda < >> ar...@sigmoidanalytics.com> wrote: >> >>> Does the error change on build with and without the built options? &g

Re: Building Spark behind a proxy

2015-01-29 Thread Soumya Simanta
On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda < ar...@sigmoidanalytics.com> wrote: > Does the error change on build with and without the built options? > What do you mean by build options? I'm just doing ./sbt/sbt assembly from $SPARK_HOME > Did you try using maven? and doing the proxy sett

Building Spark behind a proxy

2015-01-29 Thread Soumya Simanta
I'm trying to build Spark (v1.1.1 and v1.2.0) behind a proxy using ./sbt/sbt assembly and I get the following error. I've set the http and https proxy as well as the JAVA_OPTS. Any idea what am I missing ? [warn] one warning found org.apache.maven.model.building.ModelBuildingException: 1 problem

Spark UI and Spark Version on Google Compute Engine

2015-01-17 Thread Soumya Simanta
I'm deploying Spark using the "Click to Deploy" Hadoop -> "Install Apache Spark" on Google Compute Engine. I can run Spark jobs on the REPL and read data from Google storage. However, I'm not sure how to access the Spark UI in this deployment. Can anyone help? Also, it deploys Spark 1.1. It there

Re: Sharing sqlContext between Akka router and "routee" actors ...

2014-12-18 Thread Soumya Simanta
why do you need a router? I mean cannot you do with just one actor which has the SQLContext inside it? On Thu, Dec 18, 2014 at 9:45 PM, Manoj Samel wrote: > Hi, > > Akka router creates a sqlContext and creates a bunch of "routees" actors > with sqlContext as parameter. The actors then execute q

Re: Trying to understand a basic difference between these two configurations

2014-12-05 Thread Soumya Simanta
lse. > > On Fri, Dec 5, 2014 at 7:31 PM, Soumya Simanta > wrote: > > I'm trying to understand the conceptual difference between these two > > configurations in term of performance (using Spark standalone cluster) > > > > Case 1: > > > > 1 Node >

Trying to understand a basic difference between these two configurations

2014-12-05 Thread Soumya Simanta
I'm trying to understand the conceptual difference between these two configurations in term of performance (using Spark standalone cluster) Case 1: 1 Node 60 cores 240G of memory 50G of data on local file system Case 2: 6 Nodes 10 cores per node 40G of memory per node 50G of data on HDFS nodes

Re: Spark or MR, Scala or Java?

2014-11-22 Thread Soumya Simanta
Thanks Sean. adding user@spark.apache.org again. On Sat, Nov 22, 2014 at 9:35 PM, Sean Owen wrote: > On Sun, Nov 23, 2014 at 2:20 AM, Soumya Simanta > wrote: > > Is the MapReduce API "simpler" or the implementation? Almost, every Spark > > presentation has a sl

Re: MongoDB Bulk Inserts

2014-11-21 Thread Soumya Simanta
tFile(inputFile) > .map(parser.parse) > .mapPartitions(bulkLoad) > > But the Iterator[T] of mapPartitions is always empty, even though I know > map is generating records. > > > On Thu Nov 20 2014 at 9:25:54 PM Soumya Simanta > wrote: > >&g

Re: MongoDB Bulk Inserts

2014-11-20 Thread Soumya Simanta
On Thu, Nov 20, 2014 at 10:18 PM, Benny Thompson wrote: > I'm trying to use MongoDB as a destination for an ETL I'm writing in > Spark. It appears I'm gaining a lot of overhead in my system databases > (and possibly in the primary documents themselves); I can only assume it's > because I'm left

Parsing a large XML file using Spark

2014-11-18 Thread Soumya Simanta
If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump that all revision information also) that is stored in HDFS, is it possible to parse it in parallel/faster using Spark? Or do we have to use something like a PullParser or Iteratee? My current solution is to read the single X

Re: SparkSQL performance

2014-10-31 Thread Soumya Simanta
all the > in-depth tuning tricks of all products. However, realistically, there is a > big gap in terms of documentation. Hope the Spark folks will make a > difference. :-) > > Du > > > From: Soumya Simanta > Date: Friday, October 31, 2014 at 4:04 PM > To: "user@s

Re: SparkSQL performance

2014-10-31 Thread Soumya Simanta
all the > in-depth tuning tricks of all products. However, realistically, there is a > big gap in terms of documentation. Hope the Spark folks will make a > difference. :-) > > Du > > > From: Soumya Simanta > Date: Friday, October 31, 2014 at 4:04 PM > To: "user@s

SparkSQL performance

2014-10-31 Thread Soumya Simanta
I was really surprised to see the results here, esp. SparkSQL "not completing" http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required.

Re: sbt/sbt compile error [FATAL]

2014-10-29 Thread Soumya Simanta
Are you trying to compile the master branch ? Can you try branch-1.1 ? On Wed, Oct 29, 2014 at 6:55 AM, HansPeterS wrote: > Hi, > > I have cloned sparked as: > git clone g...@github.com:apache/spark.git > cd spark > sbt/sbt compile > > Apparently http://repo.maven.apache.org/maven2 is no longer

Re: run multiple spark applications in parallel

2014-10-28 Thread Soumya Simanta
ster --driver-memory > 1g --executor-memory 1g --executor-cores 1 UBER.JAR > ${ZK_PORT_2181_TCP_ADDR} my-consumer-group1 1 > > > The box has > > 24 CPUs, Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz > > 32 GB RAM > > > Thanks, > > Josh > > On Tue, Oct

Re: run multiple spark applications in parallel

2014-10-28 Thread Soumya Simanta
Try reducing the resources (cores and memory) of each application. > On Oct 28, 2014, at 7:05 PM, Josh J wrote: > > Hi, > > How do I run multiple spark applications in parallel? I tried to run on yarn > cluster, though the second application submitted does not run. > > Thanks, > Josh

Re: install sbt

2014-10-28 Thread Soumya Simanta
sbt is just a jar file. So you really don't need to install anything. Once you run the jar file (sbt-launch.jar) it can download the required dependencies. I use an executable script called sbt that has the following contents. SBT_OPTS="-Xms1024M -Xmx2048M -Xss1M -XX:+CMSClassUnloadingEnabled -

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-27 Thread Soumya Simanta
You need to change the Scala compiler from IntelliJ to “sbt incremental compiler” (see the screenshot below). You can access this by going to “preferences” ­> “scala”. NOTE: This is supported only for certain version of IntelliJ scala plugin. See this link for details. http://blog.jetbrains.c

Re: Spark as Relational Database

2014-10-26 Thread Soumya Simanta
ng >>> queries with SQL. >>> >>> I create the input to MLib by doing a massive JOIN query. So, I am >>> creating a single collection by combining many collections. This sort of >>> operation is very inefficient in Mongo, Cassandra or HDFS. >>> >&

Re: Spark as Relational Database

2014-10-26 Thread Soumya Simanta
n is very inefficient in Mongo, Cassandra or HDFS. >> >> I could store my data in a relational database, and copy the query >> results to Spark for processing. However, I was hoping I could keep >> everything in Spark. >> >> On Sat, Oct 25, 2014 at 11:34 PM, Soumy

Re: Spark as Relational Database

2014-10-25 Thread Soumya Simanta
1. What data store do you want to store your data in ? HDFS, HBase, Cassandra, S3 or something else? 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? One option is to process the data in Spark and then store it in the relational database of your choice. On Sat, Oct 25, 2014 at 1

Does start-slave.sh use the values in conf/slaves to launch a worker in Spark standalone cluster mode

2014-10-20 Thread Soumya Simanta
I'm working a cluster where I need to start the workers separately and connect them to a master. I'm following the instructions here and using branch-1.1 http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually and I can start the master using ./sbin/start-master.sh

Convert a org.apache.spark.sql.SchemaRDD[Row] to a RDD of Strings

2014-10-09 Thread Soumya Simanta
I've a SchemaRDD that I want to convert to a RDD that contains String. How do I convert the Row inside the SchemaRDD to String?

Storing shuffle files on a Tachyon

2014-10-07 Thread Soumya Simanta
Is it possible to store spark shuffle files on Tachyon ?

Creating a feature vector from text before using with MLLib

2014-10-01 Thread Soumya Simanta
I'm trying to understand the intuition behind the features method that Aaron used in one of his demos. I believe this feature will just work for detecting the character set (i.e., language used). Can someone help ? def featurize(s: String): Vector = { val n = 1000 val result = new Array[Doub

Re: Accumulator Immutability?

2014-09-22 Thread Soumya Simanta
I don't know the exact implementation of accumulator. You can look at the sources. But for Scala look at the following REPL session. scala> val al = new ArrayList[String]() al: java.util.ArrayList[String] = [] scala> al.add("a") res1: Boolean = true scala> al res2: java.util.ArrayList[Strin

Setting serializer to KryoSerializer from command line for spark-shell

2014-09-20 Thread Soumya Simanta
Hi, I want to set the serializer for my spark-shell to Kyro. spark.serializer to org.apache.spark.serializer.KryoSerializer Can I do it without setting a new SparkConf? Thanks -Soumya

Re: Problem with giving memory to executors on YARN

2014-09-20 Thread Soumya Simanta
in your boxes? > looks like you are assigning 32 cores "per" executor - is that what you > want? are there other applications running on the cluster? you might want > to check YARN UI to see how many containers are getting allocated to your > application. > > > On Sep 1

Problem with giving memory to executors on YARN

2014-09-19 Thread Soumya Simanta
I'm launching a Spark shell with the following parameters ./spark-shell --master yarn-client --executor-memory 32g --driver-memory 4g --executor-cores 32 --num-executors 8 but when I look at the Spark UI it shows only 209.3 GB total memory. Executors (10) - *Memory:* 55.9 GB Used (209.3 GB

Re: rsync problem

2014-09-19 Thread Soumya Simanta
One possible reason is maybe that the checkpointing directory $SPARK_HOME/work is rsynced as well. Try emptying the contents of the work folder on each node and try again. On Fri, Sep 19, 2014 at 4:53 AM, rapelly kartheek wrote: > I > * followed this command:rsync -avL --progress path/to/spark

Re: Spark as a Library

2014-09-16 Thread Soumya Simanta
It depends on what you want to do with Spark. The following has worked for me. Let the container handle the HTTP request and then talk to Spark using another HTTP/REST interface. You can use the Spark Job Server for this. Embedding Spark inside the container is not a great long term solution IMO b

Re: About SpakSQL OR MLlib

2014-09-15 Thread Soumya Simanta
case class Car(id:String,age:Int,tkm:Int,emissions:Int,date:Date, km:Int, fuel:Int) 1. Create an PairedRDD of (age,Car) tuples (pairedRDD) 2. Create a new function fc //returns the interval lower and upper bound def fc(x:Int, interval:Int) : (Int,Int) = { val floor = x - (x%interval)

Re: Spark and Scala

2014-09-12 Thread Soumya Simanta
An RDD is a fault-tolerant distributed structure. It is the primary abstraction in Spark. I would strongly suggest that you have a look at the following to get a basic idea. http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html http://spark.apache.org/docs/latest/quick-start.htm

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-17 Thread Soumya Simanta
Brandon, Thanks for sharing this. Looks very promising. The project mentions - "process petabytes of data in real-time". I'm curious to know if the architecture implemented in the Github repo was used to process petabytes? If yes, how many nodes did you use for this and did you use Spark standalo

Re: Running Spark shell on YARN

2014-08-16 Thread Soumya Simanta
arse(URI.java:3038) at java.net.URI.(URI.java:753) at org.apache.hadoop.fs.Path.initialize(Path.java:203) ... 62 more Spark context available as sc. On Fri, Aug 15, 2014 at 3:49 PM, Soumya Simanta wrote: > After changing the allocation I'm getting the following in my logs. No > idea what this means. > >

Re: Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
ug 15, 2014 at 2:47 PM, Sandy Ryza wrote: > We generally recommend setting yarn.scheduler.maximum-allocation-mbto the > maximum node capacity. > > -Sandy > > > On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta > wrote: > >> I just checked the YARN config and loo

Re: Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
I just checked the YARN config and looks like I need to change this value. Should be upgraded to 48G (the max memory allocated to YARN) per node ? yarn.scheduler.maximum-allocation-mb 6144 java.io.BufferedInputStream@2e7e1ee On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta wrote: > And

Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
I've been using the standalone cluster all this time and it worked fine. Recently I'm using another Spark cluster that is based on YARN and I've not experience with YARN. The YARN cluster has 10 nodes and a total memory of 480G. I'm having trouble starting the spark-shell with enough memory. I'm

Script to deploy spark to Google compute engine

2014-08-13 Thread Soumya Simanta
Before I start doing something on my own I wanted to check if someone has created a script to deploy the latest version of Spark to Google Compute Engine. Thanks -Soumya

Re: Transform RDD[List]

2014-08-11 Thread Soumya Simanta
Try something like this. scala> val a = sc.parallelize(List(1,2,3,4,5)) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :12 scala> val b = sc.parallelize(List(6,7,8,9,10)) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at :12 scala>

Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
> Barring the statements to create the spark context, if I copy paste the lines > of my code in spark shell, runs perfectly giving the desired output. > > ~Sarath > >> On Wed, Jul 16, 2014 at 7:48 PM, Soumya Simanta >> wrote: >> When you submit your job, it sho

Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
t; wrong, all are info messages. > > What else do I need check? > > ~Sarath > > On Wed, Jul 16, 2014 at 7:23 PM, Soumya Simanta > wrote: > >> Check your executor logs for the output or if your data is not big >> collect it in the driver and print it. >>

Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
Check your executor logs for the output or if your data is not big collect it in the driver and print it. > On Jul 16, 2014, at 9:21 AM, Sarath Chandra > wrote: > > Hi All, > > I'm trying to do a simple record matching between 2 files and wrote following > code - > > import org.apache.sp

Re: Client application that calls Spark and receives an MLlib *model* Scala Object, not just result

2014-07-14 Thread Soumya Simanta
Please look at the following. https://github.com/ooyala/spark-jobserver http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language https://github.com/EsotericSoftware/kryo You can train your model convert it to PMML and return that to your client OR You can train your model and write that mod

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Soumya Simanta
Try running a simple standalone program if you are using Scala and see if you are getting any data. I use this to debug any connection/twitter4j issues. import twitter4j._ //put your keys and creds here object Util { val config = new twitter4j.conf.ConfigurationBuilder() .setOAuthConsumer

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Soumya Simanta
Do you have a proxy server ? If yes you need to set the proxy for twitter4j > On Jul 11, 2014, at 7:06 PM, SK wrote: > > I dont get any exceptions or error messages. > > I tried it both with and without VPN and had the same outcome. But I can > try again without VPN later today and report b

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
> I think my best option is to partition my data in directories by day > before running my Spark application, and then direct > my Spark application to load RDD's from each directory when > I want to load a date range. How does this sound? > > If your upstream system can write data by day then it m

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
If you are on 1.0.0 release you can also try converting your RDD to a SchemaRDD and run a groupBy there. The SparkSQL optimizer "may" yield better results. It's worth a try at least. On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta wrote: > > > > >> >>

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
> > Solution 2 is to map the objects into a pair RDD where the > key is the number of the day in the interval, then group by > key, collect, and parallelize the resulting grouped data. > However, I worry collecting large data sets is going to be > a serious performance bottleneck. > > Why do you ha

Re: Comparative study

2014-07-07 Thread Soumya Simanta
Daniel, Do you mind sharing the size of your cluster and the production data volumes ? Thanks Soumya > On Jul 7, 2014, at 3:39 PM, Daniel Siegmann wrote: > > From a development perspective, I vastly prefer Spark to MapReduce. The > MapReduce API is very constrained; Spark's API feels muc

Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Soumya Simanta
On Jul 1, 2014, at 7:47 PM, Marco Shaw wrote: >> >> They are recorded... For example, 2013: http://spark-summit.org/2013 >> >> I'm assuming the 2014 videos will be up in 1-2 weeks. >> >> Marco >> >> >>> On Tue, Jul 1, 2014 at 3:1

Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Soumya Simanta
Are these sessions recorded ? On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos wrote: > > > > > > > *General Session / Keynotes : > http://www.ustream.tv/channel/spark-summit-2014 > Track A > : http://www.ustream.tv/channel/track-a1 >

Re: Spark streaming and rate limit

2014-06-18 Thread Soumya Simanta
that? > > > On Thu, Jun 19, 2014 at 12:24 AM, Soumya Simanta > wrote: > >> >> You can add a back pressured enabled component in front that feeds data >> into Spark. This component can control in input rate to spark. >> >> > On Jun 18, 2014, at 6:

Re: Spark streaming and rate limit

2014-06-18 Thread Soumya Simanta
You can add a back pressured enabled component in front that feeds data into Spark. This component can control in input rate to spark. > On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier wrote: > > Hi to all, > in my use case I'd like to receive events and call an external service as > they pa

Re: Unable to run a Standalone job

2014-05-22 Thread Soumya Simanta
Try cleaning your maven (.m2) and ivy cache. > On May 23, 2014, at 12:03 AM, Shrikar archak wrote: > > Yes I did a sbt publish-local. Ok I will try with Spark 0.9.1. > > Thanks, > Shrikar > > >> On Thu, May 22, 2014 at 8:53 PM, Tathagata Das >> wrote: >> How are you getting Spark with 1.

Re: Run Apache Spark on Mini Cluster

2014-05-21 Thread Soumya Simanta
Suggestion - try to get an idea of your hardware requirements by running a sample on Amazon's EC2 or Google compute engine. It's relatively easy (and cheap) to get started on the cloud before you invest in your own hardware IMO. On Wed, May 21, 2014 at 8:14 PM, Upender Nimbekar wrote: > Hi, >

Re: Historical Data as Stream

2014-05-17 Thread Soumya Simanta
@Laeeq - please see this example. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala#L47-L49 On Sat, May 17, 2014 at 2:06 PM, Laeeq Ahmed wrote: > @Soumya Simanta > > Right now its just a prove of concept. Lat

Re: Historical Data as Stream

2014-05-16 Thread Soumya Simanta
File is just a steam with a fixed length. Usually streams don't end but in this case it would. On the other hand if you real your file as a steam may not be able to use the entire data in the file for your analysis. Spark (give enough memory) can process large amounts of data quickly. > On M

Re: Proper way to create standalone app with custom Spark version

2014-05-16 Thread Soumya Simanta
Install your custom spark jar to your local maven or ivy repo. Use this custom jar in your pom/sbt file. > On May 15, 2014, at 3:28 AM, Andrei wrote: > > (Sorry if you have already seen this message - it seems like there were some > issues delivering messages to the list yesterday) > > We

Stable Hadoop version supported ?

2014-05-15 Thread Soumya Simanta
Currently I've HDFS with version hadoop0.20.2-cdh3u6 on Spark 0.9.1. I want to upgrade to Spark 1.0.0 soon and would also like to upgrade my HDFS version as well. What's the recommended version of HDFS to use with Spark 1.0.0? I don't know much about YARN but I would just like to use the Spark sta

Fwd: Is there a way to load a large file from HDFS faster into Spark

2014-05-15 Thread Soumya Simanta
I've a Spark cluster with 3 worker nodes. - *Workers:* 3 - *Cores:* 48 Total, 48 Used - *Memory:* 469.8 GB Total, 72.0 GB Used I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB compressed and 11GB uncompressed. When I try to read the compressed file from HDFS i

Re: How to use spark-submit

2014-05-11 Thread Soumya Simanta
--arg >>> ${spark.master} >>> --arg >>> ${my app arg 1} >>> --arg >>> ${my arg 2} >>> >>> >>&g

Re: Fwd: Is there a way to load a large file from HDFS faster into Spark

2014-05-11 Thread Soumya Simanta
cks to > be read and for subsequent processing. > On 11 May 2014 09:01, "Soumya Simanta" wrote: > >> >> >> I've a Spark cluster with 3 worker nodes. >> >> >>- *Workers:* 3 >>- *Cores:* 48 Total, 48 Used >>- *Memory:

Is there a way to load a large file from HDFS faster into Spark

2014-05-10 Thread Soumya Simanta
I've a Spark cluster with 3 worker nodes. - *Workers:* 3 - *Cores:* 48 Total, 48 Used - *Memory:* 469.8 GB Total, 72.0 GB Used I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB compressed and 11GB uncompressed. When I try to read the compressed file from HDFS i

Re: How to use spark-submit

2014-05-05 Thread Soumya Simanta
Yes, I'm struggling with a similar problem where my class are not found on the worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate if someone can provide some documentation on the usage of spark-submit. Thanks > On May 5, 2014, at 10:24 PM, Stephen Boesch wrote: > > > I ha

Problem with sharing class across worker nodes using spark-shell on Spark 1.0.0

2014-05-05 Thread Soumya Simanta
Hi, I'm trying to run a simple Spark job that uses a 3rd party class (in this case twitter4j.Status) in the spark-shell using spark-1.0.0_SNAPSHOT I'm starting my bin/spark-shell with the following command. ./spark-shell *--driver-class-path*"$LIBPATH/jodatime2.3/joda-convert-1.2.jar:$LIBPATH/j

Caused by: java.lang.OutOfMemoryError: unable to create new native thread

2014-05-05 Thread Soumya Simanta
I just upgraded my Spark version to 1.0.0_SNAPSHOT. commit f25ebed9f4552bc2c88a96aef06729d9fc2ee5b3 Author: witgo Date: Fri May 2 12:40:27 2014 -0700 I'm running a standalone cluster with 3 workers. - *Workers:* 3 - *Cores:* 48 Total, 0 Used - *Memory:* 469.8 GB Total, 0.0 B Used

Re: Announcing Spark SQL

2014-03-26 Thread Soumya Simanta
Very nice. Any plans to make the SQL typesafe using something like Slick ( http://slick.typesafe.com/) Thanks ! On Wed, Mar 26, 2014 at 5:58 PM, Michael Armbrust wrote: > Hey Everyone, > > This already went out to the dev list, but I wanted to put a pointer here > as well to a new feature we a

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Soumya Simanta
@Diana - you can set sbt manually for your project by following the instructions here. http://www.scala-sbt.org/release/docs/Getting-Started/Setup.html Manual Installation¶ Manual installation requires downloa

Re: sstream.foreachRDD

2014-03-04 Thread Soumya Simanta
I think you need to call collect . > On Mar 4, 2014, at 11:18 AM, Adrian Mocanu wrote: > > Hi > I’ve noticed that if in the driver of a spark app I have a foreach and add > stream elements to a list from the stream, the list contains no elements at > the end of the processing. > > Take this

Help with building and running examples with GraphX from the REPL

2014-02-25 Thread Soumya Simanta
I'm not able to run the GraphX examples from the Scala REPL. Can anyone point to the correct documentation that talks about the configuration and/or how to build GraphX for the REPL ? Thanks

Re: Filter on Date by comparing

2014-02-24 Thread Soumya Simanta
ting > correctly. Back in 0.7.x days though, there was an issue where some of the > Joda libraries wouldn't correctly serialize with Kryo, but I think that's > since been fixed: > https://groups.google.com/forum/#!topic/cascalog-user/35cdnNIamKU > > HTH, > Andrew &

Running GraphX example from Scala REPL

2014-02-24 Thread Soumya Simanta
I'm trying to run the GraphX examples from the Scala REPL. However, it complains that it cannot find RDD. :23: error: not found: type RDD val users: RDD[(VertexId, (String, String))] = I'm using a Feb 3 commit of incubator spark. Should I do anything differently to build GraphX ? or is

Filter on Date by comparing

2014-02-24 Thread Soumya Simanta
I want to filter a RDD by comparing dates. myRDD.filter( x => new DateTime(x.getCreatedAt).isAfter(start) ).count I'm using the JodaTime library but I get an exception about a Jodatime class not serializable. Is there a way to configure this or an easier alternative for this problem. org.apac