Is there a way to prevent an RDD from shuffling in a join operation without
repartitioning it?
I'm reading an RDD from sharded MongoDB, joining that with an RDD of
incoming data (+ some additional calculations), and writing the resulting
RDD back to MongoDB. It would make sense to shuffle only th
Is there a way to drop parquet file partitions through Spark? I'm
partitioning a parquet file by a date field and I would like to drop old
partitions in a file system agnostic manner. I guess I could read the whole
parquet file into a DataFrame, filter out the dates to be dropped, and
overwrite the
I'm interested in knowing which NoSQL databases you use with Spark and what
are your experiences.
On a general level, I would like to use Spark streaming to process incoming
data, fetch relevant aggregated data from the database, and update the
aggregates in the DB based on the incoming records.
Hi,
I am trying to fit a logistic regression model with cross validation in
Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded where
each element is a pair of RDDs containing the training and test data:
(training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint],
test_d
with the classes in Mllib now so
> you'll have to roll your own using underlying sgd / bfgs primitives.
> —
> Sent from Mailbox
>
> On Sat, Jul 5, 2014 at 10:45 AM, Christopher Nguyen <
> ctn@
> >
> wrote:
>
>> Hi sparkuser2345,
>> I'm infer
Hi,
I'm running Spark in an EMR cluster and I'm able to read from S3 using REPL
without problems:
val input_file = "s3:///test_data.txt"
val rawdata = sc.textFile(input_file)
val test = rawdata.collect
but when I try to run a simple standalone application reading the same data,
I get an erro
I'm getting the same "Input path does not exist" error also after setting the
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using
the format "s3:///test_data.txt" for the input file.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/P
Evan R. Sparks wrote
> Try s3n://
Thanks, that works! In REPL, I can succesfully load the data using both
s3:// and s3n://, why the difference?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-reading-from-S3-in-standalone-application-tp11524p11537.
Matei Zaharia wrote
> If you use s3n:// for both, you should be able to pass the exact same file
> to load as you did to save.
I'm trying to write a file to s3n in a Spark app and to read it in another
one using the same file name, but without luck. Writing data to s3n as
val data = Array(1.0, 1
sparkuser2345 wrote
> I'm using Spark 1.0.0.
The same works when
- Using Spark 0.9.1.
- Saving to and reading from local file system (Spark 1.0.0)
- Saving to and reading from HDFS (Spark 1.0.0)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How
Ashish Rangole wrote
> Specify a folder instead of a file name for input and output code, as in:
>
> Output:
> s3n://your-bucket-name/your-data-folder
>
> Input: (when consuming the above output)
>
> s3n://your-bucket-name/your-data-folder/*
Unfortunately no luck:
Exception in thread "main" o
I'm running spark 1.0.0 on EMR. I'm able to access the master web UI but not
the worker web UIs or the application detail UI ("Server not found").
I added the following inbound rule to the ElasticMapreduce-slave security
group but it didn't help:
Type = All TCP
Port range = 0 - 65535
Source = My
I have an array 'dataAll' of key-value pairs where each value is an array of
arrays. I would like to parallelize a task over the elements of 'dataAll' to
the workers. In the dummy example below, the number of elements in 'dataAll'
is 3 but in real application it would be tens to hundreds.
Without
).
What are the limiting factors to the size of the elements of an RDD?
sparkuser2345 wrote
> I have an array 'dataAll' of key-value pairs where each value is an array
> of arrays. I would like to parallelize a task over the elements of
> 'dataAll' to the workers. In
14 matches
Mail list logo