try repartition of rdd to say 2x number of cores available before
saveAsTextFile
-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-in-standlone-alone-mode-saveAsTextFile-runs-for-forever-tp19848p19867.
Hi Mukesh,
Once you create a streming job, a DAG is created which contains your job
plan i.e. all map transformation and all action operations to be performed
on each batch of streaming application.
So, once your job is started, the input dstream take the data input from
specified source and all
you can try creating hadoop Configuration and set s3 configuration i.e.
access keys etc.
Now, for reading files from s3 use newAPIHadoopFile and pass the config
object here along with key, value classes.
-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context:
http://apa
While creating sparkConf, set the variable *"spark.cores.max"* to
th"spark.cores.max maximum number of cores to be used by spark job.
By default it is set to 1.
-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Do-s
It doesn't work that way.
Following is the correct way:
val table = sc.textFile(args(1))
val histMap = table.map(x => {
x.split('|')(0).toInt,1
})
-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-e
Hi,
Since, the cassandra object is not serializable you can't open the
connection on driver level and access the object inside foreachRDD (i.e. at
worker level).
You have to open connection inside foreachRDD only, perform the operation
and then close the connection.
For example:
wordCounts.fore
Hi TD,
Thanks a lot for your reply :)
I am already looking into creating a new DStream for SQS messages. It would
be very helpful if you can provide with some guidance regarding the same.
The main motive of integrating SQS with spark streaming is to make my Jobs
run in high availability.
As of no
Hi Sanjeet,
I have been using spark streaming for processing of files present in S3 and
HDFS.
I am also using SQS messages for the same purpose as yours i.e. pointer to
S3 file.
As of now, I have a separate SQS job which receive message from SQS queue
and gets the corresponding file from S3.
Now,
you can try repartition/coalesce and make the final RDD into a single
partition before saveAsTextFile.
This should bring the content of whole RDD into single part-
-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.c
Hi,
Thanks TD for your reply. I am still not able to resolve the problem for my
use case.
I have let's say 1000 different RDD's, and I am applying a transformation
function on each RDD and I want the output of all rdd's combined to a single
output RDD. For, this I am doing the following:
**
tempRD
Hi,
You can use map functions like flatmapValues and mapValues, which will
apply the map fucntion on each pairRDD contained in your input
pairDstream and returns the paired Dstream
On Fri, Jun 13, 2014 at 8:48 AM, ryan_seq [via Apache Spark User List] <
ml-node+s1001560n7550...@n3.nabble.com> wr
If we do cache() + count() after say every 50 iterations. The whole process
becomes very slow.
I have tried checkpoint() , cache() + count(), saveAsObjectFiles().
Nothing works.
Materializing RDD's lead to drastic decrease in performance & if we don't
materialize, we face stackoverflowerror.
On W
If we do cache() + count() after say every 50 iterations. The whole process
becomes very slow.
I have tried checkpoint() , cache() + count(), saveAsObjectFiles().
Nothing works.
Materializing RDD's lead to drastic decrease in performance & if we don't
materialize, we face stackoverflowerror.
--
Hi,
We got spork working on spark 0.9.0
Repository available at:
https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix
Please suggest your feedback.
-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Pig
Hi,
I have been following Aniket's spork github repository.
https://github.com/aniket486/pig
I have done all the changes mentioned in recently modified pig-spark file.
I am using:
hadoop 2.0.5 alpha
spark-0.8.1-incubating
mesos 0.16.0
##PIG variables
export *HADOOP_CONF_DIR*=$HADOOP_INSTALL/etc/
15 matches
Mail list logo