from:"Gylfi"

Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-06 Thread Gylfi

actually correct? Hope this helps.. Regards, Gylfi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-Kryo-Serializer-is-slower-than-Java-Serializer-in-TeraSort-tp23621p23659.html Sent from the Apache Spark User List mailing list archive at

Re: How do we control output part files created by Spark job?

2015-07-06 Thread Gylfi

Hi. Have you tried to repartition the finalRDD before saving? This link might help. http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter3/save_the_rdd_to_files.html Regards, Gylfi. -- View this message in context: http://apache-spark-user

Re: how to black list nodes on the cluster

2015-07-06 Thread Gylfi

spark.speculation.multiplier spark.speculation.quantile See https://spark.apache.org/docs/latest/configuration.html under Scheduling. Regards, Gylfi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-black-list-nodes-on-the-cluster

Re: how to black list nodes on the cluster

2015-07-07 Thread Gylfi

I am afraid I am out of ideas ;) Regards and good luck, Gylfi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-black-list-nodes-on-the-cluster-tp23650p23704.html Sent from the Apache Spark User List mail

Re: How do we control output part files created by Spark job?

2015-07-07 Thread Gylfi

Hi. I am just wondering if the rdd was actually modified. Did you test it by printing rdd.partitions.length before and after? Regards, Gylfi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job

Re: Flatten list

2015-07-18 Thread Gylfi

explicit version.. A simpler could would be something like this .. val flattnedIntRDD : RDD[(Int)] = intArraysRDD.flatmap( array => array.toList) However, to understand exactly your problem you need to explain better what the RDD you want to create should look like.. Regards, Gylfi.

Re: Spark same execution time on 1 node and 5 nodes

2015-07-18 Thread Gylfi

more parts before line 52 by calling "rddname".repartition(10) for example and see if it runs faster.. Regards, Gylfi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-same-execution-time-on-1-node-and-5-nodes-tp23866p23893.html Sen

Re: Using reference for RDD is safe?

2015-07-18 Thread Gylfi

e count printed out. After the operation both RDDs are "destroyed" again. If you run the myrdd2.count again, both myrdd and myrdd2 are created again .. If your transformation is expensive, you may want to keep the data around and for that must use .persist() or .cache() etc. Regards, Gyl

Re: Create RDD from output of unix command

2015-07-18 Thread Gylfi

You may want to look into using the pipe command .. http://blog.madhukaraphatak.com/pipe-in-spark/ http://spark.apache.org/docs/0.6.0/api/core/spark/rdd/PipedRDD.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Create-RDD-from-output-of-unix-command-tp

Re: No. of Task vs No. of Executors

2015-07-18 Thread Gylfi

You could even try changing the block size of the input data on HDFS (can be done on a per file basis) and that would get all workers going right from the get-go in Spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824

Re: write a HashMap to HDFS in Spark

2015-07-18 Thread Gylfi

Hi. Assuming your have the data in an RDD you can save your RDD (regardless of structure) with "nameRDD".saveAsObjectFile("path") where "path" can be "hdfs:///myfolderonHDFS" or the local file system. Alternatively you can also use .saveAsTextFile(

Re: Passing Broadcast variable as parameter

2015-07-18 Thread Gylfi

structure as it is not synced between workers after it is broadcasted. To broadcast, your data must be serializable. If the data you are trying to broadcast is a distributed RDD (and thus I assumably large), perhaps what you need is some form of join operation (or cogroup)? Regards, Gylfi

Re: K Nearest Neighbours

2015-07-18 Thread Gylfi

How does that sound? Does this make any sense? :) Regards, Gylfi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-Nearest-Neighbours-tp23759p23899.html Sent from the Apache Spark User List mailing list archiv

Re: filtering reversed tuples

2015-10-16 Thread Gylfi

You could map over the whole list and output always the lower value as the key and then use the unique feature to remove duplicate tuples. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filtering-reversed-tuples-tp25082p25086.html Sent from the Apache Spar

Re: Does Spark use more memory than MapReduce?

2015-10-16 Thread Gylfi

By default Spark will actually not keep the data at all, it will just store "how" to recreate the data. The programmer can however choose to keep the data once instantiated by calling "/.persist()/" or "/.cache()/" on the RDD. /.cache/ will store the data in-memory only and fail if it will not fi

Re: How to lookup by a key in an RDD

2015-11-01 Thread Gylfi

Hi. You may want to look into Indexed RDDs https://github.com/amplab/spark-indexedrdd Regards, Gylfi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-lookup-by-a-key-in-an-RDD-tp25243p25247.html Sent from the Apache Spark User List

Re: job hangs when using pipe() with reduceByKey()

2015-11-01 Thread Gylfi

Hi. What is slow exactly? In code-base 1: When you run the persist() + count() you stored the result in RAM. Then the map + reducebykey is done on in-memory data. In the latter case (all-in-oneline) you are doing both steps at the same time. So you are saying that if you sum-up the time to

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-26 Thread Gylfi

HDFS has a default replication factor of 3 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471p25497.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Optimizing large collect operations

2015-11-26 Thread Gylfi

107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195) ... 13 more" I have already set set the akka.timeout to 300 etc. Anyone have any ideas on what the problem could be ? Regares, Gylfi. -- View this message in context: http://apache-spark-user-list.1

Re: WARN MemoryStore: Not enough space

2015-11-27 Thread Gylfi

"spark.storage.memoryFraction 0.05" If you want to store a lot of memory I think this must be a higher fraction. The default is 0.6 (not 0.0X). To change the output directory you can set "spark.local.dir=/path/to/dir" and you can even specify multiple directories (for example if you have multi

Re: How to work with a joined rdd in pyspark?

2015-11-29 Thread Gylfi

Hi. Your code is like this right? "/joined_dataset = show_channel.join(show_views) joined_dataset.take(4)/" well /joined_dataset / is now an array (because you used /.take(4)/ ). So it does not support any RDD operations.. Could that be the problem? Otherwise more code is needed to understa

Re: Spark and simulated annealing

2015-11-29 Thread Gylfi

1) Start by looking at ML-lib or KeystoneML 2) If you can't find an impl., start by analyzing the access patterns and data manipulations you will need to implement. 3) Then figure out if it fits Spark structures.. and when you realized it doesn't you start speculating on how you can twist or st

Re: How to work with a joined rdd in pyspark?

2015-11-29 Thread Gylfi

Hi. Can't you do a filter, to get only the ABC shows, map that into a keyed instance of the show, and then do a reduceByKey to sum up the views? Something like this in Scala code: /filter for the channel new pair (show, view count) / val myAnswer = joined_dataset.filter( _._2._1 == "ABC"

Re: How to work with a joined rdd in pyspark?

2015-11-29 Thread Gylfi

Can't you just access it by element, like with [0] and [1] ? http://www.tutorialspoint.com/python/python_tuples.htm -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-work-with-a-joined-rdd-in-pyspark-tp25510p25517.html Sent from the Apache Spark User

Re: partition RDD of images

2015-11-29 Thread Gylfi

Look at KeystoneML, there is an image processing pipeline there -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/partition-RDD-of-images-tp25515p25518.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Why the length of each task varies

2015-07-27 Thread Gylfi

pending on how much RAM you have per node, you may want to re-block the data on HDFS for optimal performance. Hope this helps, Gylfi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-the-length-of-each-task-varies-tp24008p24014.html Sent from the Ap

Re: Writing binary files in Spark

2015-07-27 Thread Gylfi

-serialization-in-spark/ Perhaps you can use it as a base to write a "back-to-binary" override? Sorry for not more detailed answer. Regards, Gylfi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Writing-binary-files-in-Spark-tp23970p24015.html

Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

Re: How do we control output part files created by Spark job?

Re: how to black list nodes on the cluster

Re: how to black list nodes on the cluster

Re: How do we control output part files created by Spark job?

Re: Flatten list

Re: Spark same execution time on 1 node and 5 nodes

Re: Using reference for RDD is safe?

Re: Create RDD from output of unix command

Re: No. of Task vs No. of Executors

Re: write a HashMap to HDFS in Spark

Re: Passing Broadcast variable as parameter

Re: K Nearest Neighbours

Re: filtering reversed tuples

Re: Does Spark use more memory than MapReduce?

Re: How to lookup by a key in an RDD

Re: job hangs when using pipe() with reduceByKey()

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

Optimizing large collect operations

Re: WARN MemoryStore: Not enough space

Re: How to work with a joined rdd in pyspark?

Re: Spark and simulated annealing

Re: How to work with a joined rdd in pyspark?

Re: How to work with a joined rdd in pyspark?

Re: partition RDD of images

Re: Why the length of each task varies

Re: Writing binary files in Spark

27 matches

Site Navigation

Mail list logo

Footer information