Re: Using spark.memory.useLegacyMode true does not yield expected behavior

2016-04-11 Thread Tom Hubregtsen
Solved: Call spark-submit with --driver-memory 512m --driver-java-options "-Dspark.memory.useLegacyMode=true -Dspark.shuffle.memoryFraction=0.2 -Dspark.storage.memoryFraction=0.6 -Dspark.storage.unrollFraction=0.2" Thanks to: https://issues.apache.org/jira/browse/SPARK-14367 -- View this messa

Using spark.memory.useLegacyMode true does not yield expected behavior

2016-03-29 Thread Tom Hubregtsen
Hi, I am trying to get the same memory behavior in Spark 1.6 as I had in Spark 1.3 with default settings. I set --driver-java-options "--Dspark.memory.useLegacyMode=true -Dspark.shuffle.memoryFraction=0.2 -Dspark.storage.memoryFraction=0.6 -Dspark.storage.unrollFraction=0.2" in Spark 1.6. But

50% performance decrease when using local file vs hdfs

2015-07-24 Thread Tom Hubregtsen
to not use HDFS) * Bonus question: Should I use a different API to get a better performance? Thanks for any responses! Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/50-performance-decrease-when-using-local-file-vs-hdfs-tp23987.html Sent f

Re: Info from the event timeline appears to contradict dstat info

2015-07-15 Thread Tom Hubregtsen
oid confusion :) Best regards, Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Info-from-the-event-timeline-appears-to-contradict-dstat-info-tp23862p23865.html Sent from the Apache Spark User List mail

Info from the event timeline appears to contradict dstat info

2015-07-15 Thread Tom Hubregtsen
ork included in any of these 7 labels? Thanks in advance, Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Info-from-the-event-timeline-appears-to-contradict-dstat-info-tp23862.html Sent from the Apache Spark User List mail

Re: Un-persist RDD in a loop

2015-06-23 Thread Tom Hubregtsen
I believe that as you are not persisting anything into the memory space defined by spark.storage.memoryFraction you also have nothing to clear from this area using the unpersist. FYI: The data will be kept in the OS-buffer/on disk at the point of the reduce (as this involves a wide dependency ->

PartitionBy/Partitioner for dataFrames?

2015-06-21 Thread Tom Hubregtsen
only available on pairRDD's, this might have something to with it..) I am using the spark master branch. The error: [error] /home/th/spark-1.5.0/spark/IBM_ARL_teraSort_v4-01/src/main/scala/IBM_ARL_teraSort.scala:107: value partitionBy is not a member of org.apache.spark.sql.DataFrame Thanks,

DataFrames for non-SQL computation?

2015-06-11 Thread Tom Hubregtsen
I've looked a bit into what DataFrames are, and it seems that most posts on the subject are related to SQL, but it does seem to be very efficient. My main questions is: Are DataFrames also beneficial for non-SQL computations? For instance I want to: - sort k/v pairs (in particular, is the naive v

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
"I'm not sure, but I wonder if because you are using the Spark REPL that it may not be representing what a normal runtime execution would look like and is possibly eagerly running a partial DAG once you define an operation that would cause a shuffle. What happens if you setup your same set of comm

Re: Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
Thanks for the responses. "Try removing toDebugString and see what happens. " The toDebugString is performed after [d] (the action), as [e]. By then all stages are already executed. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-execute

Extra stage that executes before triggering computation with an action

2015-04-29 Thread Tom Hubregtsen
a what is running in this Job/stage 0? Thanks, Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-executes-before-triggering-computation-with-an-action-tp22707.html Sent from the Apache Spark User List mailing list arch

Re: Spark TeraSort source request

2015-04-13 Thread Tom Hubregtsen
ce >> code. >> I've tried to search in the Spark User forum archive's, seeing request of >> people, indicating a demand, but did not succeed in finding the actual >> source code. >> >> My question: >> Could you guys please make the source code

Re: "Spark-events does not exist" error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
exist, (ii) it > existed but the user could not navigate to it or (iii) it existed but > was not actually a directory. > > So please double-check all that. > > On Mon, Mar 30, 2015 at 5:11 PM, Tom Hubregtsen > wrote: > > Stack trace: > > 15/03/30 17:37:30 INFO storage

Re: "Spark-events does not exist" error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
lways helps to show the command line you're actually running, and > if there's an exception, the first few frames of the stack trace.) > > On Mon, Mar 30, 2015 at 4:11 PM, Tom Hubregtsen > wrote: > > Updated spark-defaults and spark-env: > > "Log directory /hom

Re: "Spark-events does not exist" error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
Updated spark-defaults and spark-env: "Log directory /home/hduser/spark/spark-events does not exist." (Also, in the default /tmp/spark-events it also did not work) On 30 March 2015 at 18:03, Marcelo Vanzin wrote: > Are those config values in spark-defaults.conf? I don't think you can > use "~" t

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
1.pdf>. > It is expected to scale sub-linearly; i.e., O(log N), where N is the > number of machines in your cluster. > We evaluated up to 100 machines, and it does follow O(log N) scaling. > > -- > Mosharaf Chowdhury > http://www.mosharaf.com/ > > On Wed, Mar 11, 2015 at

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
Thanks Mosharaf, for the quick response! Can you maybe give me some pointers to an explanation of this strategy? Or elaborate a bit more on it? Which parts are involved in which way? Where are the time penalties and how scalable is this implementation? Thanks again, Tom On 11 March 2015 at 16:01