Hi,
Regarding docker scripts I know i can change the base image easily but is
there any specific reason why the base image is hadoop_1.2.1 . Why is this
prefered to Hadoop2 [HDP2, CDH5]) distributions?
Now that amazon supports docker could this replace ec2-scripts?
Kind regards
Dimitri
--
If i use groupbyKey as so...
JavaPairRDD> twos = ones.groupByKey(3).cache();
How would I write to a file/ or Hadoop the contents of the List of Strings.
Do i need to transform the JavaPairRDD to JavaRDD and call f saveAsTextFile?
--
View this message in context:
http://apache-spark-user-li
I am probably not stating my problem correctly and have yet fully understood
the Java Spark API,
This is what i would like to do. Read a file (which is not sorted), sort the
file by a key extracted by each line. Then partition the initial file into k
files. The only restriction is that all lines a
Hi,
Can anyone point out any examples on the web other than the java examples
offered by the spark documentation.
To be more specific an example using external jars and property files in the
classpath.
Thanks in advance
Dimitri
--
View this message in context:
http://apache-spark-user-list
On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia wrote:
> Is there a reason for spark using the older akka?
>
>
>
>
> On Sun, Mar 2, 2014 at 1:53 PM, 1esha wrote:
>
> The problem is in akka remote. It contains files compiled with 2.4.*. When
>
> you run it with 2.5.* in classpath i
Thanks for the example.
However my main problem is that what i would like to do is:
Create a SparkApp that will Sort and Partition the initial file (k) times
based on a key.
JavaSparkContext ctx = new JavaSparkContext("spark://dmpour:7077",
"BasicFileSplit", System.getenv("SPARK_HOME"),
J
Hello,
Has anyone got any ideas? I am not quite sure if my problem is an exact fit
for Spark. Since in reality in this
section of my program i am not really doing a reduce job simply a group by
and partition.
Would calling pipe on the Partiotined JavaRDD do the trick? Are there any
examples usin
How exactly does rdd.mapPartitions be executed once in each VM?
I am running mapPartitions and the call function seems not to execute the
code?
JavaPairRDD twos = input.map(new
Split()).sortByKey().partitionBy(new HashPartitioner(k));
twos.values().saveAsTextFile(args[2]);
JavaRDD ls = twos.va
Is it possible to do this:\
JavaRDD parttionedRdds = input.map(new
Split()).sortByKey().partitionBy(new HashPartitioner(k)).values();
parttionedRdds.saveAsTextFile(args[2]);
//Then run my SingletonFunction (My app depends on the saved Files)
parttionedRdds.map(new SingletonFunc());
The partti
Hi all,
Say I have an input file which I would like to partition using
HashPartitioner k times.
Calling rdd.saveAsTextFile(""hdfs://"); will save k files as part-0
part-k
Is there a way to save each partition in specific folders?
i.e. src
part0/part-0
part1/part-00
Can you provide an example?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754p3823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I am not exacly sure how to use MultipleOutput in Spark. Have been looking
into Apache Crunch ? in its guide http://crunch.apache.org/user-guide.html
it states that:
Multiple outputs: Spark doesn't have a concept of multiple outputs; when you
write a data set to disk, the pipeline that creates tha
12 matches
Mail list logo