Preventing an RDD from shuffling

2015-12-16 Thread sparkuser2345
Is there a way to prevent an RDD from shuffling in a join operation without repartitioning it? I'm reading an RDD from sharded MongoDB, joining that with an RDD of incoming data (+ some additional calculations), and writing the resulting RDD back to MongoDB. It would make sense to shuffle only th

Dropping parquet file partitions

2016-03-01 Thread sparkuser2345
Is there a way to drop parquet file partitions through Spark? I'm partitioning a parquet file by a date field and I would like to drop old partitions in a file system agnostic manner. I guess I could read the whole parquet file into a DataFrame, filter out the dates to be dropped, and overwrite the

Experiences about NoSQL databases with Spark

2015-11-24 Thread sparkuser2345
I'm interested in knowing which NoSQL databases you use with Spark and what are your experiences. On a general level, I would like to use Spark streaming to process incoming data, fetch relevant aggregated data from the database, and update the aggregates in the DB based on the incoming records.

How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread sparkuser2345
Hi, I am trying to fit a logistic regression model with cross validation in Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded where each element is a pair of RDDs containing the training and test data: (training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint], test_d

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-07 Thread sparkuser2345
with the classes in Mllib now so > you'll have to roll your own using underlying sgd / bfgs primitives. > — > Sent from Mailbox > > On Sat, Jul 5, 2014 at 10:45 AM, Christopher Nguyen < > ctn@ > > > wrote: > >> Hi sparkuser2345, >> I'm infer

Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
Hi, I'm running Spark in an EMR cluster and I'm able to read from S3 using REPL without problems: val input_file = "s3:///test_data.txt" val rawdata = sc.textFile(input_file) val test = rawdata.collect but when I try to run a simple standalone application reading the same data, I get an erro

Re: Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
I'm getting the same "Input path does not exist" error also after setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using the format "s3:///test_data.txt" for the input file. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/P

Re: Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
Evan R. Sparks wrote > Try s3n:// Thanks, that works! In REPL, I can succesfully load the data using both s3:// and s3n://, why the difference? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-reading-from-S3-in-standalone-application-tp11524p11537.

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
Matei Zaharia wrote > If you use s3n:// for both, you should be able to pass the exact same file > to load as you did to save. I'm trying to write a file to s3n in a Spark app and to read it in another one using the same file name, but without luck. Writing data to s3n as val data = Array(1.0, 1

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
sparkuser2345 wrote > I'm using Spark 1.0.0. The same works when - Using Spark 0.9.1. - Saving to and reading from local file system (Spark 1.0.0) - Saving to and reading from HDFS (Spark 1.0.0) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
Ashish Rangole wrote > Specify a folder instead of a file name for input and output code, as in: > > Output: > s3n://your-bucket-name/your-data-folder > > Input: (when consuming the above output) > > s3n://your-bucket-name/your-data-folder/* Unfortunately no luck: Exception in thread "main" o

Unable to access worker web UI or application UI (EC2)

2014-08-08 Thread sparkuser2345
I'm running spark 1.0.0 on EMR. I'm able to access the master web UI but not the worker web UIs or the application detail UI ("Server not found"). I added the following inbound rule to the ElasticMapreduce-slave security group but it didn't help: Type = All TCP Port range = 0 - 65535 Source = My

Parallelizing a task makes it freeze

2014-08-11 Thread sparkuser2345
I have an array 'dataAll' of key-value pairs where each value is an array of arrays. I would like to parallelize a task over the elements of 'dataAll' to the workers. In the dummy example below, the number of elements in 'dataAll' is 3 but in real application it would be tens to hundreds. Without

Re: Parallelizing a task makes it freeze

2014-08-12 Thread sparkuser2345
). What are the limiting factors to the size of the elements of an RDD? sparkuser2345 wrote > I have an array 'dataAll' of key-value pairs where each value is an array > of arrays. I would like to parallelize a task over the elements of > 'dataAll' to the workers. In