executor processes are still there even I killed the app and the workers

2014-05-10 Thread Nan Zhu
Hi, all With Spark 1.0 RC3, I found that the executor processes are still there even I killed the app and the workers? Any one found the same problem (maybe also exist in other versions)? Best, -- Nan Zhu

Creating time-sequential pairs

2014-05-10 Thread Nicholas Pritchard
Hi Spark community, I have a design/algorithm question that I assume is common enough for someone else to have tackled before. I have an RDD of time-series data formatted as time-value tuples, RDD[(Double, Double)], and am trying to extract threshold crossings. In order to do so, I first want to t

Re: Creating time-sequential pairs

2014-05-10 Thread Sean Owen
How about ... val data = sc.parallelize(Array((1,0.05),(2,0.10),(3,0.15))) val pairs = data.join(data.map(t => (t._1 + 1, t._2))) It's a self-join, but one copy has its ID incremented by 1. I don't know if it's performant but works, although output is more like: (2,(0.1,0.05)) (3,(0.15,0.1)) On

Re: os buffer cache does not cache shuffle output file

2014-05-10 Thread wxhsdp
is there something wrong with the mailing list? very few people see my thread -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478p5521.html Sent from the Apache Spark User List mailing list archive at Nab

Re: Schema view of HadoopRDD

2014-05-10 Thread Debasish Das
Hi, For each line that we read as textLine from HDFS, we have a schema..if there is an API that takes the schema as List[Symbol] and maps each token to the Symbol it will be helpful... One solution is to keep data on hdfs as avro/protobuf serialized objects but not sure if that works on HBase inp

time exhausted in BlockFetcher

2014-05-10 Thread wxhsdp
Hi, all i'am tuning my app in local mode, and found there was lots of time spent in local block fetch. in stage1: i read in input data, and do a repartition, in stage2: i do some operation on the repartitioned RDD, so it involves a local block fetch, i find that the fetch

Re: os buffer cache does not cache shuffle output file

2014-05-10 Thread Koert Kuipers
yes it seems broken. i got only a few emails in last few days On Fri, May 9, 2014 at 7:24 AM, wxhsdp wrote: > is there something wrong with the mailing list? very few people see my > thread > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/os-buffe

Re: problem about broadcast variable in iteration

2014-05-10 Thread randylu
i run in spark 1.0.0, the newest under-development version. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5480.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: os buffer cache does not cache shuffle output file

2014-05-10 Thread Aaron Davidson
Seems the mailing list was broken when you sent your original question, so I appended it to the end of this message. "Buffers" is relatively unimportant in today's Linux kernel; "cache" is used for both writing and reading [1]. What you are seeing seems to be the expected behavior: the data is wri

Re: How to read a multipart s3 file?

2014-05-10 Thread kamatsuoka
For example, this app just reads a 4GB file and writes a copy of it. It takes 41 seconds to write the file, then 3 more minutes to move all the temporary files. I guess this is an issue with the hadoop / jets3t code layer, not Spark. 14/05/06 20:11:41 INFO TaskSetManager: Finished TID 63 in 8688

Is there a way to load a large file from HDFS faster into Spark

2014-05-10 Thread Soumya Simanta
I've a Spark cluster with 3 worker nodes. - *Workers:* 3 - *Cores:* 48 Total, 48 Used - *Memory:* 469.8 GB Total, 72.0 GB Used I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB compressed and 11GB uncompressed. When I try to read the compressed file from HDFS i