Not getting event logs >= spark 1.3.1

2015-06-15 Thread Tsai Li Ming
Hi, I have this in my spark-defaults.conf (same for hdfs): spark.eventLog.enabled true spark.eventLog.dir file:/tmp/spark-events spark.history.fs.logDirectory file:/tmp/spark-events While the app is running, there is a “.inprogress” directory. However when the job complet

Re: Not getting event logs >= spark 1.3.1

2015-06-16 Thread Tsai Li Ming
Forgot to mention this is on standalone mode. Is my configuration wrong? Thanks, Liming On 15 Jun, 2015, at 11:26 pm, Tsai Li Ming wrote: > Hi, > > I have this in my spark-defaults.conf (same for hdfs): > spark.eventLog.enabled true > spark.eventLog.dir

Issues building 1.4.0 using make-distribution

2015-06-17 Thread Tsai Li Ming
Hi, I downloaded the source from Downloads page and ran the make-distribution.sh script. # ./make-distribution.sh --tgz -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package The script has “-x” set in the beginning. ++ /tmp/a/spark-1.4.0/build/mvn help:evaluate -Dexpression=project.ve

Documentation for external shuffle service in 1.4.0

2015-06-17 Thread Tsai Li Ming
Hi, I can’t seem to find any documentation on this feature in 1.4.0? Regards, Liming - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

The usage of OpenBLAS

2015-06-26 Thread Tsai Li Ming
Hi, I found out that the instructions for OpenBLAS has been changed by the author of netlib-java in: https://github.com/apache/spark/pull/4448 since Spark 1.3.0 In that PR, I asked whether there’s still a need to compile OpenBLAS with USE_THREAD=0, and also about Intel MKL. Is it still applica

Re: Confused why I'm losing workers/executors when writing a large file to S3

2015-01-21 Thread Tsai Li Ming
I’m getting the same issue on Spark 1.2.0. Despite having set “spark.core.connection.ack.wait.timeout” in spark-defaults.conf and verified in the job UI (port 4040) environment tab, I still get the “no heartbeat in 60 seconds” error. spark.core.connection.ack.wait.timeout=3600 15/01/22 07:29:

Re: Logstash as a source?

2015-02-01 Thread Tsai Li Ming
I have been using a logstash alternative - fluentd to ingest the data into hdfs. I had to configure fluentd to not append the data so that spark streaming will be able to pick up the new logs. -Liming On 2 Feb, 2015, at 6:05 am, NORD SC wrote: > Hi, > > I plan to have logstash send log even

Re: data locality

2014-07-25 Thread Tsai Li Ming
Hi, In the standalone mode, how can we check data locality is working as expected when tasks are assigned? Thanks! On 23 Jul, 2014, at 12:49 am, Sandy Ryza wrote: > On standalone there is still special handling for assigning tasks within > executors. There just isn't special handling for w

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-12 Thread Tsai Li Ming
Another observation I had was reading over local filesystem with “file://“. it was stated as PROCESS_LOCAL which was confusing. Regards, Liming On 13 Sep, 2014, at 3:12 am, Nicholas Chammas wrote: > Andrew, > > This email was pretty helpful. I feel like this stuff should be summarized in >

RDD memory and storage level option

2014-11-20 Thread Tsai Li Ming
Hi, This is on version 1.1.0. I’m did a simple test on MEMORY_AND_DISK storage level. > var file = > sc.textFile(“file:///path/to/file.txt”).persit(StorageLevel.MEMORY_AND_DISK) > file.count() The file is 1.5GB and there is only 1 worker. I have requested for 1GB of worker memory per node:

Understanding stages in WebUI

2014-11-25 Thread Tsai Li Ming
Hi, I have the classic word count example: > file.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_ + > _).collect() From the Job UI, I can only see 2 stages: 0-collect and 1-map. What happened to ShuffledRDD in reduceByKey? And both flatMap and map operations is collapsed i

Spark temp dir (spark.local.dir)

2014-03-13 Thread Tsai Li Ming
Hi, I'm confused about the -Dspark.local.dir and SPARK_WORKER_DIR(--work-dir). What's the difference? I have set -Dspark.local.dir for all my worker nodes but I'm still seeing directories being created in /tmp when the job is running. I have also tried setting -Dspark.local.dir when I run the

Re: Spark temp dir (spark.local.dir)

2014-03-13 Thread Tsai Li Ming
>> spark.local.dir can and should be set both on the executors and on the >> driver (if the driver broadcast variables, the files will be stored in this >> directory) Do you mean the worker nodes? Don’t think they are jetty connectors and the directories are empty: /tmp/spark-3e330cdc-7540-4313-

JVM memory in local threading (SparkLR example)

2014-03-13 Thread Tsai Li Ming
Hi, Couple of questions here: 0. I modified SparkLR.scala to change the N(# of data points) and D (# of dimensions) , and ran it with: # bin/run-example -Dspark.executor.memory=40g org.apache.spark.examples.SparkLR local[23] 500 And here’s the process table: /net/home/ltsai/jdk1.7.0_51/bin/jav

Configuring shuffle write directory

2014-03-22 Thread Tsai Li Ming
Hi, Each of my worker node has its own unique spark.local.dir. However, when I run spark-shell, the shuffle writes are always written to /tmp despite being set when the worker node is started. By specifying the spark.local.dir for the driver program, it seems to override the executor? Is there

Kmeans example reduceByKey slow

2014-03-23 Thread Tsai Li Ming
Hi, At the reduceBuyKey stage, it takes a few minutes before the tasks start working. I have -Dspark.default.parallelism=127 cores (n-1). CPU/Network/IO is idling across all nodes when this is happening. And there is nothing particular on the master log file. From the spark-shell: 14/03/23 1

Re: Kmeans example reduceByKey slow

2014-03-23 Thread Tsai Li Ming
gt; Xiangrui > > On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming wrote: >> Hi, >> >> At the reduceBuyKey stage, it takes a few minutes before the tasks start >> working. >> >> I have -Dspark.default.parallelism=127 cores (n-1). >> >> CPU/Net

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
n, Mar 23, 2014 at 11:53 PM, Tsai Li Ming wrote: >> Hi, >> >> This is on a 4 nodes cluster each with 32 cores/256GB Ram. >> >> (0.9.0) is deployed in a stand alone mode. >> >> Each worker is configured with 192GB. Spark executor memory is also 192GB. &

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
> the initialization stage. If your data is sparse, the latest change to > KMeans will help with the speed, depending on how sparse your data is. > -Xiangrui > > On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming wrote: >> Thanks, Let me try with a smaller K. >> >&

Re: Configuring shuffle write directory

2014-03-27 Thread Tsai Li Ming
Anyone can help? How can I configure a different spark.local.dir for each executor? On 23 Mar, 2014, at 12:11 am, Tsai Li Ming wrote: > Hi, > > Each of my worker node has its own unique spark.local.dir. > > However, when I run spark-shell, the shuffle writes are always wri

Re: Configuring shuffle write directory

2014-03-27 Thread Tsai Li Ming
conf/spark-env.sh on those workers. > > Matei > > On Mar 27, 2014, at 9:04 PM, Tsai Li Ming wrote: > >> Anyone can help? >> >> How can I configure a different spark.local.dir for each executor? >> >> >> On 23 Mar, 2014, at 12:11 am, Tsai L

Setting SPARK_MEM higher than available memory in driver

2014-03-27 Thread Tsai Li Ming
Hi, My worker nodes have more memory than the host that I’m submitting my driver program, but it seems that SPARK_MEM is also setting the Xmx of the spark shell? $ SPARK_MEM=100g MASTER=spark://XXX:7077 bin/spark-shell Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x7f

Re: Setting SPARK_MEM higher than available memory in driver

2014-03-27 Thread Tsai Li Ming
rsion of Spark. > > > On Thu, Mar 27, 2014 at 10:48 PM, Tsai Li Ming wrote: > Hi, > > My worker nodes have more memory than the host that I’m submitting my driver > program, but it seems that SPARK_MEM is also setting the Xmx of the spark > shell? > > $ SPARK_MEM=

Re: Configuring shuffle write directory

2014-03-27 Thread Tsai Li Ming
far as I can tell, spark.local.dir should *not* > be set there, so workers should get it from their spark-env.sh. It’s true > that if you set spark.local.dir in the driver it would pass that on to the > workers for that job. > > Matei > > On Mar 27, 2014, at 9:57 PM, Tsai Li Ming

Re: WikipediaPageRank Data Set

2014-03-29 Thread Tsai Li Ming
I’m interested in obtaining the data set too. Thanks! On 27 Mar, 2014, at 9:45 pm, Niko Stahl wrote: > Hello, > > I would like to run the WikipediaPageRank example, but the Wikipedia dump XML > files are no longer available on Freebase. Does anyone know an alternative > source for the data? >

Hadoop LR comparison

2014-03-31 Thread Tsai Li Ming
Hi, Is the code available for Hadoop to calculate the Logistic Regression hyperplane? I’m looking at the Examples: http://spark.apache.org/examples.html, where there is the 110s vs 0.9s in Hadoop vs Spark comparison. Thanks!

Re: Hadoop LR comparison

2014-03-31 Thread Tsai Li Ming
-- > Web: http://alpinenow.com/ > > > On Mon, Mar 31, 2014 at 11:38 PM, Tsai Li Ming wrote: > Hi, > > Is the code available for Hadoop to calculate the Logistic Regression > hyperplane? > > I’m looking at the Examples: > http: