RE: HdfsWordCount only counts some of the words

2014-09-23 Thread Liu, Raymond
It should count all the words, so you probably need to post more details on how you run it and the log, output etc. Best Regards, Raymond Liu -Original Message- From: SK [mailto:skrishna...@gmail.com] Sent: Wednesday, September 24, 2014 5:04 AM To: u...@spark.incubator.apache.org Subje

RE: spark.local.dir and spark.worker.dir not used

2014-09-23 Thread Liu, Raymond
When did you check the dir’s contents? When the application finished, those dirs will be cleaned. Best Regards, Raymond Liu From: Chitturi Padma [mailto:learnings.chitt...@gmail.com] Sent: Tuesday, September 23, 2014 8:33 PM To: u...@spark.incubator.apache.org Subject: Re: spark.local.dir and sp

RE: memory size for caching RDD

2014-09-04 Thread Liu, Raymond
Regards, Raymond Liu From: 牛兆捷 [mailto:nzjem...@gmail.com] Sent: Thursday, September 04, 2014 2:57 PM To: Liu, Raymond Cc: Patrick Wendell; user@spark.apache.org; d...@spark.apache.org Subject: Re: memory size for caching RDD Oh I see. I want to implement something like this: sometimes I need to

RE: memory size for caching RDD

2014-09-03 Thread Liu, Raymond
You don’t need to. It is not static allocated to RDD cache, it is just an up limit. If you don’t use up the memory by RDD cache, it is always available for other usage. except those one also controlled by some memoryFraction conf. e.g. spark.shuffle.memoryFraction which you also set the up limit

RE: RDDs

2014-09-03 Thread Liu, Raymond
Actually, a replicated RDD and a parallel job on the same RDD, this two conception is not related at all. A replicated RDD just store data on multiple node, it helps with HA and provide better chance for data locality. It is still one RDD, not two separate RDD. While regarding run two jobs on th

RE: resize memory size for caching RDD

2014-09-03 Thread Liu, Raymond
AFAIK, No. Best Regards, Raymond Liu From: 牛兆捷 [mailto:nzjem...@gmail.com] Sent: Thursday, September 04, 2014 11:30 AM To: user@spark.apache.org Subject: resize memory size for caching RDD Dear all: Spark uses memory to cache RDD and the memory size is specified by "spark.storage.memoryFracti

RE: RDDs

2014-09-03 Thread Liu, Raymond
Not sure what did you refer to when saying replicated rdd, if you actually mean RDD, then, yes , read the API doc and paper as Tobias mentioned. If you actually focus on the word "replicated", then that is for fault tolerant, and probably mostly used in the streaming case for receiver created RD

RE: how to filter value in spark

2014-08-31 Thread Liu, Raymond
You could use cogroup to combine RDDs in one RDD for cross reference processing. e.g. a.cogroup(b). filter{case (_, (l,r)) => l.nonEmpty && r.nonEmpty }. map{case (k,(l,r)) => (k, l)} Best Regards, Raymond Liu -Original Message- From: marylucy [mailto:qaz163wsx_...@hotmail.com] Sent:

RE: The concurrent model of spark job/stage/task

2014-08-31 Thread Liu, Raymond
1,2 :As the docs mentioned, "if they were submitted from separate threads" say, you fork your main thread and invoke action in each thread. Job and stage is always numbered in order , while not necessary corresponding to their execute order, but generated order. In your case, If you just call mu

RE: What is a Block Manager?

2014-08-26 Thread Liu, Raymond
gards, Raymond Liu From: Victor Tso-Guillen [mailto:v...@paxata.com] Sent: Wednesday, August 27, 2014 1:40 PM To: Liu, Raymond Cc: user@spark.apache.org Subject: Re: What is a Block Manager? We're a single-app deployment so we want to launch as many executors as the system has workers. We accompli

RE: What is a Block Manager?

2014-08-26 Thread Liu, Raymond
Basically, a Block Manager manages the storage for most of the data in spark, name a few: block that represent a cached RDD partition, intermediate shuffle data, broadcast data etc. it is per executor, while in standalone mode, normally, you have one executor per worker. You don't control how m

RE: spark.default.parallelism bug?

2014-08-26 Thread Liu, Raymond
Hi Grzegorz From my understanding, for cogroup operation ( which used by intersection), if spark.default.parallelism is not set by user, it won’t bother to use the default value, it will use the partition number ( the max one among all the rdds in cogroup operation) to build up a parti

RE: Request for help in writing to Textfile

2014-08-25 Thread Liu, Raymond
You can try to manipulate the string you want to output before saveAsTextFile, something like modify. flatMap(x=>x).map{x=> val s=x.toString s.subSequence(1,s.length-1) } Should have more optimized way. Best Regards, Raymond Liu -Original Message- From: yh18190 [mailto:yh

RE: About StorageLevel

2014-06-26 Thread Liu, Raymond
I think there is a shuffle stage involved. And the future count job will depends on the first job’s shuffle stages’s output data directly as long as it is still available. Thus it will be much faster. Best Regards, Raymond Liu From: tomsheep...@gmail.com [mailto:tomsheep...@gmail.com] Sent: Frid

RE: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-06-05 Thread Liu, Raymond
If some task have no locality preference, it will also show up as PROCESS_LOCAL, yet, I think we probably need to name it NO_PREFER to make it more clear. Not sure is this your case. Best Regards, Raymond Liu From: coded...@gmail.com [mailto:coded...@gmail.com] On Behalf Of Sung Hwan Chung Se

RE: yarn-client mode question

2014-05-21 Thread Liu, Raymond
Seems you are asking that does spark related jar need to be deploy to yarn cluster manually before you launch application? Then, no , you don't, just like other yarn application. And it doesn't matter it is yarn-client or yarn-cluster mode.. Best Regards, Raymond Liu -Original Message-

RE: different in spark on yarn mode and standalone mode

2014-05-04 Thread Liu, Raymond
In the core, they are not quite different In standalone mode, you have spark master and spark worker who allocate driver and executors for your spark app. While in Yarn mode, Yarn resource manager and node manager do this work. When the driver and executors have been launched, the rest part of res

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
2MB/s, and kryo doubles. So it seems to me that when running the full path code in my previous case, 32 core with 50MB/s total throughput are reasonable? Best Regards, Raymond Liu -Original Message- From: Liu, Raymond [mailto:raymond@intel.com] Later case, total throughput aggregate

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
7;m no expert. On Tue, Apr 29, 2014 at 10:14 PM, Liu, Raymond wrote: > For all the tasks, say 32 task on total > > Best Regards, > Raymond Liu > > > -Original Message- > From: Patrick Wendell [mailto:pwend...@gmail.com] > > Is this the serialization throug

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
directly instead of read from HDFS, similar throughput result) Best Regards, Raymond Liu -Original Message- From: Liu, Raymond [mailto:raymond@intel.com] For all the tasks, say 32 task on total Best Regards, Raymond Liu -Original Message- From: Patrick Wendell

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
For all the tasks, say 32 task on total Best Regards, Raymond Liu -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Is this the serialization throughput per task or the serialization throughput for all the tasks? On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond

How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
Hi I am running a WordCount program which count words from HDFS, and I noticed that the serializer part of code takes a lot of CPU time. On a 16core/32thread node, the total throughput is around 50MB/s by JavaSerializer, and if I switching to KryoSerializer, it doubles to around 100-150

About pluggable storage roadmap?

2014-04-29 Thread Liu, Raymond
Hi I noticed that in spark 1.0 meetup, on 1.1 and beyond roadmap, it mentioned support for pluggable storage strategies. We are also planning on similar things to enable block manager to store data on more storage media. So is there any exist plan or design or rough idea on this

RE: Shuffle Spill Issue

2014-04-29 Thread Liu, Raymond
ne entry per word occurrence. On Tue, Apr 29, 2014 at 7:48 AM, Liu, Raymond wrote: Hi  Patrick         I am just doing simple word count , the data is generated by hadoop random text writer.         This seems to me not quite related to compress , If I turn off compress on shuffle, the me

RE: Shuffle Spill Issue

2014-04-28 Thread Liu, Raymond
nd on-disk shuffle amount is definitely a bit strange, the data gets compressed when written to disk, but unless you have a weird dataset (E.g. all zeros) I wouldn't expect it to compress _that_ much. On Mon, Apr 28, 2014 at 1:18 AM, Liu, Raymond wrote: Hi         I am running a s

RE: questions about debugging a spark application

2014-04-28 Thread Liu, Raymond
If you are using the trunk code, you should be able to config spark to use eventlog to log the application/task UI contents into the history server and be able to check out the application/task details later. There are different config need to be done for standalone mode v.s. yarn/mesos mode.

Shuffle Spill Issue

2014-04-28 Thread Liu, Raymond
Hi I am running a simple word count program on spark standalone cluster. The cluster is made up of 6 node, each run 4 worker and each worker own 10G memory and 16 core thus total 96 core and 240G memory. ( well, also used to configed as 1 worker with 40G memory on each node )