If you are using textFiles() to read data in, it also takes in a parameter
the number of minimum partitions to create. Would that not work for you?
On Oct 2, 2014 7:00 AM, "jamborta" wrote:
> Hi all,
>
> I have been testing repartitioning to ensure that my algorithms get similar
> amount of data.
Hello Mark,
I am no expert but I can answer some of your questions.
On Oct 2, 2014 2:15 AM, "Mark Mandel" wrote:
>
> Hi,
>
> So I'm super confused about how to take my Spark code and actually deploy
and run it on a cluster.
>
> Let's assume I'm writing in Java, and we'll take a simple example su
Hello Sanjay,
This can be done, and is a very effective way to debug.
1) Compile and package your project to get a fat jar
2) In your SparkConf use setJars and give location of this jar. Also set
your master here as local in SparkConf
3) Use this SparkConf when creating JavaSparkContext
4) Debug
Hello,
I have written a standalone spark job which I run through Ooyala Job
Server. The program is working correctly, now I'm looking into how to
optimize it.
My program without optimization took 4 hours to run. The first optimization
of KyroSerializer and compiling regex pattern and reusing them
I solved this issue by putting hbase-protobuf in Hadoop classpath, and not
in the spark classpath.
export HADOOP_CLASSPATH="/path/to/jar/hbase-protocol-0.98.1-cdh5.1.0.jar"
On Tue, Aug 26, 2014 at 5:42 PM, Ashish Jain wrote:
> Hello,
>
> I'm using the following
Hello,
I'm using the following version of Spark - 1.0.0+cdh5.1.0+41
(1.cdh5.1.0.p0.27).
I've tried to specify the libraries Spark uses using the following ways -
1) Adding it to spark context
2) Specifying the jar path in
a) spark.executor.extraClassPath
b) spark.executor.extraLibraryPath
3)