Docker Scripts

2014-07-09 Thread dmpour23
Hi, Regarding docker scripts I know i can change the base image easily but is there any specific reason why the base image is hadoop_1.2.1 . Why is this prefered to Hadoop2 [HDP2, CDH5]) distributions? Now that amazon supports docker could this replace ec2-scripts? Kind regards Dimitri --

Re: ReduceByKey or groupByKey to Count?

2014-02-26 Thread dmpour23
If i use groupbyKey as so... JavaPairRDD> twos = ones.groupByKey(3).cache(); How would I write to a file/ or Hadoop the contents of the List of Strings. Do i need to transform the JavaPairRDD to JavaRDD and call f saveAsTextFile? -- View this message in context: http://apache-spark-user-li

Re: ReduceByKey or groupByKey to Count?

2014-02-27 Thread dmpour23
I am probably not stating my problem correctly and have yet fully understood the Java Spark API, This is what i would like to do. Read a file (which is not sorted), sort the file by a key extracted by each line. Then partition the initial file into k files. The only restriction is that all lines a

Spark Java example using external Jars

2014-03-13 Thread dmpour23
Hi, Can anyone point out any examples on the web other than the java examples offered by the spark documentation. To be more specific an example using external jars and property files in the classpath. Thanks in advance Dimitri -- View this message in context: http://apache-spark-user-list

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-18 Thread dmpour23
On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia wrote: > Is there a reason for spark using the older akka? > > > > > On Sun, Mar 2, 2014 at 1:53 PM, 1esha wrote: > > The problem is in akka remote. It contains files compiled with 2.4.*. When > > you run it with 2.5.* in classpath i

Re: Spark Java example using external Jars

2014-03-20 Thread dmpour23
Thanks for the example. However my main problem is that what i would like to do is: Create a SparkApp that will Sort and Partition the initial file (k) times based on a key. JavaSparkContext ctx = new JavaSparkContext("spark://dmpour:7077", "BasicFileSplit", System.getenv("SPARK_HOME"), J

Re: Spark Java example using external Jars

2014-03-24 Thread dmpour23
Hello, Has anyone got any ideas? I am not quite sure if my problem is an exact fit for Spark. Since in reality in this section of my program i am not really doing a reduce job simply a group by and partition. Would calling pipe on the Partiotined JavaRDD do the trick? Are there any examples usin

Re: Running a task once on each executor

2014-03-27 Thread dmpour23
How exactly does rdd.mapPartitions be executed once in each VM? I am running mapPartitions and the call function seems not to execute the code? JavaPairRDD twos = input.map(new Split()).sortByKey().partitionBy(new HashPartitioner(k)); twos.values().saveAsTextFile(args[2]); JavaRDD ls = twos.va

Re: Running a task once on each executor

2014-03-28 Thread dmpour23
Is it possible to do this:\ JavaRDD parttionedRdds = input.map(new Split()).sortByKey().partitionBy(new HashPartitioner(k)).values(); parttionedRdds.saveAsTextFile(args[2]); //Then run my SingletonFunction (My app depends on the saved Files) parttionedRdds.map(new SingletonFunc()); The partti

how to save RDD partitions in different folders?

2014-04-04 Thread dmpour23
Hi all, Say I have an input file which I would like to partition using HashPartitioner k times. Calling rdd.saveAsTextFile(""hdfs://"); will save k files as part-0 part-k Is there a way to save each partition in specific folders? i.e. src part0/part-0 part1/part-00

Re: how to save RDD partitions in different folders?

2014-04-07 Thread dmpour23
Can you provide an example? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754p3823.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to save RDD partitions in different folders?

2014-04-22 Thread dmpour23
I am not exacly sure how to use MultipleOutput in Spark. Have been looking into Apache Crunch ? in its guide http://crunch.apache.org/user-guide.html it states that: Multiple outputs: Spark doesn't have a concept of multiple outputs; when you write a data set to disk, the pipeline that creates tha