Re: rdd cache name

2016-03-02 Thread charles li
y cache size or cache off-heap or to disk. > > Xinh > > On Wed, Mar 2, 2016 at 1:48 AM, charles li > wrote: > >> hi, there, I feel a little confused about the *cache* in spark. >> >> first, is there any way to *customize the cached RDD name*, it's not >

is there any way to make WEB UI auto-refresh?

2016-03-15 Thread charles li
every time I can only get the latest info by refreshing the page, that's a little boring. so is there any way to make the WEB UI auto-refreshing ? great thanks -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

the "DAG Visualiztion" in 1.6 not works fine here

2016-03-15 Thread charles li
sometimes it just shows several *black dots*, and sometimes it can not show the entire graph. did anyone meet this before and how did you fix it? ​ ​ -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

best way to do deep learning on spark ?

2016-03-19 Thread charles li
Hi, guys, I'm new to MLlib on spark, after reading the document, it seems that MLlib does not support deep learning, I want to know is there any way to implement deep learning on spark ? *Do I must use 3-party package like caffe or tensorflow ?* or *Does deep learning module list in the MLlib de

Re: best way to do deep learning on spark ?

2016-03-19 Thread charles li
layers, etc. are > currently under development. Please refer to > https://issues.apache.org/jira/browse/SPARK-5575 > > > > Best regards, Alexander > > > > *From:* charles li [mailto:charles.up...@gmail.com] > *Sent:* Wednesday, March 16, 2016 7:01 PM > *To:* user

Re: best practices: running multi user jupyter notebook server

2016-03-20 Thread charles li
Hi, andy, I think you can make that with some open source packages/libs built for IPython and Spark. here is one : https://github.com/litaotao/IPython-Dashboard On Thu, Mar 17, 2016 at 1:36 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > We are considering deploying a notebook serve

what happened if cache a RDD for multiple time?

2016-03-24 Thread charles li
happened to see this problem on stackoverflow: http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812 I think it's very interesting, and I think the answer posted by Aaron sounds promising, but I'm not sure, and I don't find the details o

Re: what happened if cache a RDD for multiple time?

2016-03-24 Thread charles li
age > */ > private[spark] def persistRDD(rdd: RDD[_]) { > persistentRdds(rdd.id) = rdd > } > > Hope this helps. > > Best > Yash > > On Thu, Mar 24, 2016 at 1:58 PM, charles li > wrote: > >> >> happened to see this problem on stackoverflow: >&g

since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread charles li
use case: have a dataset, and want to use different algorithms on that, and fetch the result. for making this, I think I should distribute my algorithms, and run these algorithms on the dataset at the same time, am I right? but it seems that spark can not parallelize/serialize algorithms/function

Re: since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread charles li
robably want to look at the map transformation, and the many more >> defined on RDDs. The function you pass in to map is serialized and the >> computation is distributed. >> >> >> On Monday, March 28, 2016, charles li wrote: >> >>> >>> use case: h

confusing about Spark SQL json format

2016-03-31 Thread charles li
as this post says, that in spark, we can load a json file in this way bellow: *post* : https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html --- sqlContext.jsonFile(fil

Re: confusing about Spark SQL json format

2016-03-31 Thread charles li
--- On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY wrote: > Hi, > Look at below image which is from json.org : > > [image: Inline image 1] > > The above image describes the object formulation of below JSON: > > Object 1=> {"name":"Yin", &

question about Reynold's talk: " The Future of Real Time"

2016-04-22 Thread charles li
hi, there, the talk *The Future of Real Time in Spark* here https://www.youtube.com/watch?v=oXkxXDG0gNk tells that there will be "BI app integration" on 24:28 of the video. what does he mean the *BI app integration* in that talk? does that mean that they will develop a BI tool like zeppelin, hue

Preview release of Spark 2.0

2016-05-29 Thread charles li
Here is the link: http://spark.apache.org/news/spark-2.0.0-preview.html congrats, haha, looking forward to 2.0.1, awesome project. -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

Is there a way to dynamic load files [ parquet or csv ] in the map function?

2016-07-08 Thread charles li
hi, guys, is there a way to dynamic load files within the map function. i.e. Can I code as bellow: ​ thanks a lot. ​ -- *___* ​ ​ Quant | Engineer | Boy *___* *blog*:http://litaotao.github.io *github*: www.github.com/litaotao

Re: Spark performance testing

2016-07-08 Thread charles li
Hi, Andrew, I've got lots of materials when asking google for "*spark performance test*" - https://github.com/databricks/spark-perf - https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf - http://people.cs.vt.edu/~butt

Re: [KafkaRDD]: rdd.cache() does not seem to work

2016-01-11 Thread charles li
cache is the default storage level of persist, and it is lazy [ not cached indeed ] until the first time it is computed. ​ On Tue, Jan 12, 2016 at 5:13 AM, ponkin wrote: > Hi, > > Here is my use case : > I have kafka topic. The job is fairly simple - it reads topic and save > data to several hd

rdd.foreach return value

2016-01-18 Thread charles li
code snippet ​ the 'print' actually print info on the worker node, but I feel confused where the 'return' value goes to. for I get nothing on the driver node. -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

Re: rdd.foreach return value

2016-01-18 Thread charles li
the way shown in the snippet. > > On Mon, Jan 18, 2016 at 7:34 PM, charles li > wrote: > >> code snippet >> >> >> ​ >> the 'print' actually print info on the worker node, but I feel confused >

Re: rdd.foreach return value

2016-01-18 Thread charles li
Unit = withScope { > > I don't think you can return element in the way shown in the snippet. > > On Mon, Jan 18, 2016 at 7:34 PM, charles li > wrote: > >> code snippet >> >> >> ​ >> the 'p

Re: rdd.foreach return value

2016-01-18 Thread charles li
s and calls the function being > passed. That's it. It doesn't collect the values and don't return any new > modified RDD. > > > On Mon, Jan 18, 2016 at 11:10 PM, charles li > wrote: > >> >> hi, great thanks to david and ted, I know that the content o

best practice : how to manage your Spark cluster ?

2016-01-20 Thread charles li
I've put a thread before: pre-install 3-party Python package on spark cluster currently I use *Fabric* to manage my cluster , but it's not enough for me, and I believe there is a much better way to *manage and monitor* the cluster. I believe there really exists some open source manage tools whic

confusing about start ipython notebook with spark between 1.3.x and 1.6.x

2016-01-31 Thread charles li
I used to use spark 1.3.x before, and explore my data in ipython [3.2] notebook, which was very stable. but I came across an error " Java gateway process exited before sending the driver its port number " my code is as bellow: ``` import pyspark from pyspark import SparkConf sc_conf = SparkCon

how to introduce spark to your colleague if he has no background about *** spark related

2016-01-31 Thread charles li
*Apache Spark™* is a fast and general engine for large-scale data processing. it's a good profile of spark, but it's really too short for lots of people if then have little background in this field. ok, frankly, I'll give a tech-talk about spark later this week, and now I'm writing a slide about

questions about progress bar status [stuck]?

2016-02-01 Thread charles li
code: --- total = int(1e8) local_collection = range(1, total) rdd = sc.parallelize(local_collection) res = rdd.collect() --- web ui status --- ​ problems: --- 1. from the status bar, it seems that the there should be about half tasks done, but it just say there is no

rdd cache priority

2016-02-04 Thread charles li
say I have 2 RDDs, RDD1 and RDD2. both are 20g in memory. and I cache both of them in memory using RDD1.cache() and RDD2.cache() the in the further steps on my app, I never use RDD1 but use RDD2 for lots of time. then here is my question: if there is only 40G memory in my cluster, and here I

spark.executor.memory ? is used just for cache RDD or both cache RDD and the runtime of cores on worker?

2016-02-04 Thread charles li
if set spark.executor.memory = 2G for each worker [ 10 in total ] does it mean I can cache 20G RDD in memory ? if so, how about the memory for code running in each process on each worker? thanks. -- and is there any materials about memory management or resource management in spark ? I want to p

how to interview spark developers

2016-02-23 Thread charles li
hi, there, we are going to recruit several spark developers, can some one give some ideas on interviewing candidates, say, spark related problems. great thanks. -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-29 Thread charles li
since spark is under actively developing, so take a book to learn it is somehow outdated to some degree. I would like to suggest learn it from several ways as bellow: - spark official document, trust me, you will go through this for several time if you want to learn in well : http://spark.

rdd cache name

2016-03-02 Thread charles li
hi, there, I feel a little confused about the *cache* in spark. first, is there any way to *customize the cached RDD name*, it's not convenient for me when looking at the storage page, there are the kind of RDD in the RDD Name column, I hope to make it as my customized name, kinds of 'rdd 1', 'rrd

Questions about disk IOs

2014-07-01 Thread Charles Li
Hi Spark, I am running LBFGS on our user data. The data size with Kryo serialisation is about 210G. The weight size is around 1,300,000. I am quite confused that the performance is very close whether the data is cached or not. The program is simple: points = sc.hadoopFIle(int, SequenceFileInput

Re: Questions about disk IOs

2014-07-25 Thread Charles Li
own On Jul 2, 2014, at 0:08, Xiangrui Meng wrote: > Try to reduce number of partitions to match the number of cores. We > will add treeAggregate to reduce the communication cost. > > PR: https://github.com/apache/spark/pull/1110 > > -Xiangrui > > On Tue, Jul 1, 2014 at

Re: Questions about disk IOs

2014-07-25 Thread Charles Li
any partitions did you use and how many CPU cores in total? The > former shouldn't be much larger than the latter. Could you also check > the shuffle size from the WebUI? -Xiangrui > > On Fri, Jul 25, 2014 at 4:10 AM, Charles Li wrote: >> Hi Xiangrui, >> >> Thanks fo

Re: Snappy error when driver is running in JBoss

2015-01-06 Thread Charles Li
Hi Thanks for the reply! I did do a echo $CLASSPATH, but I got nothing. Since we are running inside jboss, I guess the class path is not set? People did mention that JBoss loads snappy-java multiple times. But I cannot find a way to solve that problem. Cheers On Jan 6, 2015, at 5:3