Re: Using spark streaming to load data from Kafka to HDFS

2015-08-22 Thread Xu (Simon) Chen
Last time I checked, Camus doesn't support storing data as parquet, which is a deal breaker for me. Otherwise it works well for my Kafka topics with low data volume. I am currently using spark streaming to ingest data, generate semi-realtime stats and publish to a dashboard, and dump full dataset

Re: Computing mean and standard deviation by key

2014-08-01 Thread Xu (Simon) Chen
I meant not sure how to do variance in one shot :-) With mean in hand, you can obvious broadcast the variable, and do another map/reduce to calculate variance per key. On Fri, Aug 1, 2014 at 4:39 PM, Xu (Simon) Chen wrote: > val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKe

Re: Computing mean and standard deviation by key

2014-08-01 Thread Xu (Simon) Chen
val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKey((x,y) => (x._1+x._2, y._1+y._2)).collect This gives you a list of (key, (tot, count)), which you can easily calculate the mean. Not sure about variance. On Fri, Aug 1, 2014 at 2:55 PM, kriskalish wrote: > I have what seems like a relati

Re: access hdfs file name in map()

2014-08-01 Thread Xu (Simon) Chen
Hi Roberto, Ultimately, the info you need is set here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L69 Being a spark newbie, I extended org.apache.spark.rdd.HadoopRDD class as HadoopRDDWithEnv, which takes in an additional parameter (varnam

Task progress in ipython?

2014-06-26 Thread Xu (Simon) Chen
I am pretty happy with using pyspark with ipython notebook. The only issue is that I need to look at the console output or spark ui to track task progress. I wonder if anyone thought of or better wrote something to display some progress bars on the same page when I evaluate a cell in ipynb? I know

performance difference between spark-shell and spark-submit

2014-06-09 Thread Xu (Simon) Chen
Hi all, I implemented a transformation on hdfs files with spark. First tested in spark-shell (with yarn), I implemented essentially the same logic with a spark program (scala), built a jar file and used spark-submit to execute it on my yarn cluster. The weird thing is that spark-submit approach is

Re: cache spark sql parquet file in memory?

2014-06-07 Thread Xu (Simon) Chen
gt; >> Not a stupid question! I would like to be able to do this. For now, you >> might try writing the data to tachyon <http://tachyon-project.org/> >> instead of HDFS. This is untested though, please report any issues you run >> into. >> >> Mich

cache spark sql parquet file in memory?

2014-06-06 Thread Xu (Simon) Chen
This might be a stupid question... but it seems that saveAsParquetFile() writes everything back to HDFS. I am wondering if it is possible to cache parquet-format intermediate results in memory, and therefore making spark sql queries faster. Thanks. -Simon

Re: spark worker and yarn memory

2014-06-05 Thread Xu (Simon) Chen
the > cache space in the spark. > > -Sandy > > > On Jun 5, 2014, at 9:44 AM, "Xu (Simon) Chen" wrote: > > > > I am slightly confused about the "--executor-memory" setting. My yarn > cluster has a maximum container memory of 8192MB. > &g

Re: compress in-memory cache?

2014-06-05 Thread Xu (Simon) Chen
tence level is MEMORY_ONLY so > that setting will have no impact. > > > On Thu, Jun 5, 2014 at 4:41 PM, Xu (Simon) Chen wrote: > >> I have a working set larger than available memory, thus I am hoping to >> turn on rdd compression so that I can store more in-memory. Stran

compress in-memory cache?

2014-06-05 Thread Xu (Simon) Chen
I have a working set larger than available memory, thus I am hoping to turn on rdd compression so that I can store more in-memory. Strangely it made no difference. The number of cached partitions, fraction cached, and size in memory remain the same. Any ideas? I confirmed that rdd compression wasn

spark worker and yarn memory

2014-06-05 Thread Xu (Simon) Chen
I am slightly confused about the "--executor-memory" setting. My yarn cluster has a maximum container memory of 8192MB. When I specify "--executor-memory 8G" in my spark-shell, no container can be started at all. It only works when I lower the executor memory to 7G. But then, on yarn, I see 2 cont

Re: access hdfs file name in map()

2014-06-04 Thread Xu (Simon) Chen
N/M.. I wrote a HadoopRDD subclass and append one env field of the HadoopPartition to the value in compute function. It worked pretty well. Thanks! On Jun 4, 2014 12:22 AM, "Xu (Simon) Chen" wrote: > I don't quite get it.. > > mapPartitionWithIndex takes a function th

Re: Join : Giving incorrect result

2014-06-04 Thread Xu (Simon) Chen
Maybe your two workers have different assembly jar files? I just ran into a similar problem that my spark-shell is using a different jar file than my workers - got really confusing results. On Jun 4, 2014 8:33 AM, "Ajay Srivastava" wrote: > Hi, > > I am doing join of two RDDs which giving differ

Re: access hdfs file name in map()

2014-06-03 Thread Xu (Simon) Chen
spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456 > > > On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen > wrote: > >> Hello, >> >> A quick question about using spark to parse text-format CSV files stored >> on hdfs. >> >&

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
tribution.sh#L102 > > Any luck if you use JDK 6 to compile? > > > On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen > wrote: > > OK, my colleague found this: > > https://mail.python.org/pipermail/python-list/2014-May/671353.html > > > > And my jar file has

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
OK, my colleague found this: https://mail.python.org/pipermail/python-list/2014-May/671353.html And my jar file has 70011 files. Fantastic.. On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen wrote: > I asked several people, no one seems to believe that we can do this: > $ PYTHONPATH=/p

Re: spark 1.0.0 on yarn

2014-06-02 Thread Xu (Simon) Chen
ou > used when building? Also which CDH-5 version are you building against? > > On Mon, Jun 2, 2014 at 8:11 AM, Xu (Simon) Chen wrote: > > OK, rebuilding the assembly jar file with cdh5 works now... > > Thanks.. > > > > -Simon > > > > > > On Sun, J

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
n > application on YARN in general. In my experience, the steps outlined there > are quite useful. > > Let me know if you get it working (or not). > > Cheers, > Andrew > > > > 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen : > > Hi folks, >> >> I have

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
5.0.1 -DskipTests clean package" Anything that I might have missed? Thanks. -Simon On Mon, Jun 2, 2014 at 12:02 PM, Xu (Simon) Chen wrote: > 1) yes, that sc.parallelize(range(10)).count() has the same error. > > 2) the files seem to be correct > > 3) I have trouble at th

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
hich has more detailed information about how to debug running an > application on YARN in general. In my experience, the steps outlined there > are quite useful. > > Let me know if you get it working (or not). > > Cheers, > Andrew > > > > 2014-06-02 17:24 GMT+02:00 Xu (

pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I create a notebook, I can see workers being created and indeed I see spark UI running on my client

Re: spark 1.0.0 on yarn

2014-06-02 Thread Xu (Simon) Chen
OK, rebuilding the assembly jar file with cdh5 works now... Thanks.. -Simon On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen wrote: > That helped a bit... Now I have a different failure: the start up process > is stuck in an infinite loop outputting the following message: > > 14/06

Re: spark 1.0.0 on yarn

2014-06-01 Thread Xu (Simon) Chen
uot; instead of using two named > resource managers? I wonder if somehow the YARN client can't detect > this multi-master set-up. > > On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen > wrote: > > Note that everything works fine in spark 0.9, which is packaged in CDH5: >

Re: spark 1.0.0 on yarn

2014-06-01 Thread Xu (Simon) Chen
the classpath is being > set-up correctly. > > - Patrick > > On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen > wrote: > > Hi all, > > > > I tried a couple ways, but couldn't get it to work.. > > > > The following seems to be what the

spark 1.0.0 on yarn

2014-05-31 Thread Xu (Simon) Chen
Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document ( http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar YARN_CONF_DIR=/opt/hadoop/conf

access hdfs file name in map()

2014-05-29 Thread Xu (Simon) Chen
Hello, A quick question about using spark to parse text-format CSV files stored on hdfs. I have something very simple: sc.textFile("hdfs://test/path/*").map(line => line.split(",")).map(p => (XXX, p[0], p[2])) Here, I want to replace XXX with a string, which is the current csv filename for the l