Array in broadcast can't be serialized

2015-02-15 Thread Tao Xiao
I'm using Spark 1.1.0 and find that *ImmutableBytesWritable* can be serialized by Kryo but *Array[ImmutableBytesWritable] *can't be serialized even when I registered both of them in Kryo. The code is as follows: val conf = new SparkConf() .setAppName("Hello Spark")

Re: Array in broadcast can't be serialized

2015-02-16 Thread Tao Xiao
main/scala/com/twitter/chill/WrappedArraySerializer.scala > > Cheers > > On Sat, Feb 14, 2015 at 6:36 PM, Tao Xiao > wrote: > >> I'm using Spark 1.1.0 and find that *ImmutableBytesWritable* can be >> serialized by Kryo but *Array[ImmutableBytesWritable] *can&

What does "appMasterRpcPort: -1" indicate ?

2014-08-29 Thread Tao Xiao
I'm using CDH 5.1.0, which bundles Spark 1.0.0 with it. Following How-to: Run a Simple Apache Spark App in CDH 5 , I tried to submit my job in local mode, Spark Standalone mode and YARN mode. I successfully subm

Fwd: What does "appMasterRpcPort: -1" indicate ?

2014-08-30 Thread Tao Xiao
I'm using CDH 5.1.0, which bundles Spark 1.0.0 with it. Following How-to: Run a Simple Apache Spark App in CDH 5 , I tried to submit my job in local mode, Spark Standalone mode and YARN mode. I successfully subm

How can a "deserialized Java object" be stored on disk?

2014-08-30 Thread Tao Xiao
Reading about RDD Persistency , I learned that the storage level "MEMORY_AND_DISK" means that " Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on d

What does "appMasterRpcPort: -1" indicate ?

2014-08-31 Thread Tao Xiao
I'm using CDH 5.1.0, which bundles Spark 1.0.0 with it. Following How-to: Run a Simple Apache Spark App in CDH 5 , I tried to submit my job in local mode, Spark Standalone mode and YARN mode. I successfully submitted my job in local mode and Standalone mode, however, I noticed the following messag

Re: What does "appMasterRpcPort: -1" indicate ?

2014-08-31 Thread Tao Xiao
ter the application master got started ("appMasterRpcPort: 0"). 2014-08-31 23:10 GMT+08:00 Yi Tian : > I think -1 means your application master has not been started yet. > > > 在 2014年8月31日,23:02,Tao Xiao 写道: > > I'm using CDH 5.1.0, which bundles Spark 1.0.0 with i

What is the appropriate privileges needed for writting files into checkpoint directory?

2014-09-02 Thread Tao Xiao
I tried to run KafkaWordCount in a Spark standalone cluster. In this application, the checkpoint directory was set as follows : val sparkConf = new SparkConf().setAppName("KafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) ssc.checkpoint("checkpoint") After sub

Re: What is the appropriate privileges needed for writting files into checkpoint directory?

2014-09-03 Thread Tao Xiao
I found the answer. Here the file system of the checkpoint should be a fault-tolerant file system like HDFS, so we should set it to a HDFS path. It is not a local file system path. 2014-09-03 10:28 GMT+08:00 Tao Xiao : > I tried to run KafkaWordCount in a Spark standalone cluster. In t

combineByKey throws ClassCastException

2014-09-14 Thread Tao Xiao
I followd an example presented in the tutorial Learning Spark to compute the per-key average as follows: val Array(appName) = args val sparkConf = new SparkConf() .setAppName(appName) val sc = new SparkContext(

Re: combineByKey throws ClassCastException

2014-09-16 Thread Tao Xiao
ombineB > yKey at :14 > > xj @ Tokyo > > On Mon, Sep 15, 2014 at 3:06 PM, Tao Xiao > wrote: > >> I followd an example presented in the tutorial Learning Spark >> <http://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html> >> to

How to sort rdd filled with existing data structures?

2014-09-24 Thread Tao Xiao
Hi , I have the following rdd : val conf = new SparkConf() .setAppName("<< Testing Sorting >>") val sc = new SparkContext(conf) val L = List( (new Student("XiaoTao", 80, 29), "I'm Xiaotao"), (new Student("CCC", 100, 24), "I'm CCC"), (new Student("Jack", 90,

Reading from HBase is too slow

2014-09-29 Thread Tao Xiao
I submitted a job in Yarn-Client mode, which simply reads from a HBase table containing tens of millions of records and then does a *count *action. The job runs for a much longer time than I expected, so I wonder whether it was because the data to read was too much. Actually, there are 20 nodes in

Re: Reading from HBase is too slow

2014-09-29 Thread Tao Xiao
sparkConf = new SparkConf() .setAppName("<<< Reading HBase >>>") val sc = new SparkContext(sparkConf) val rdd = sc.newAPIHadoopRDD(hbaseConf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) println(rdd.count

Re: Reading from HBase is too slow

2014-09-29 Thread Tao Xiao
You can > not achieve high level of parallelism unless you have 5-10 regions per RS > at least. What does it mean? You probably have too few regions. You can > verify that in HBase Web UI. > > -Vladimir Rodionov > > On Mon, Sep 29, 2014 at 7:21 PM, Tao Xiao > wrote: > &

Re: Reading from HBase is too slow

2014-10-01 Thread Tao Xiao
; This would show whether the slowdown is in HBase code or somewhere else. > > Cheers > > On Mon, Sep 29, 2014 at 11:40 PM, Tao Xiao > wrote: > >> I checked HBase UI. Well, this table is not completely evenly spread >> across the nodes, but I think to some extent it c

Re: Reading from HBase is too slow

2014-10-07 Thread Tao Xiao
; As far as I know, that feature is not in CDH 5.0.0 >> >> FYI >> >> On Wed, Oct 1, 2014 at 9:34 AM, Vladimir Rodionov < >> vrodio...@splicemachine.com> wrote: >> >>> Using TableInputFormat is not the fastest way of reading data from >>> HBase.

Re: Reading from HBase is too slow

2014-10-08 Thread Tao Xiao
pt to submit my job can be seen in my second post. Please refer to that. 2014-10-08 13:44 GMT+08:00 Sean Owen : > How did you run your program? I don't see from your earlier post that > you ever asked for more executors. > > On Wed, Oct 8, 2014 at 4:29 AM, Tao Xiao wrote: >

Re: Reading from HBase is too slow

2014-10-08 Thread Tao Xiao
t; not like mappers. After all they may do much more in their lifetime than > just read splits from HBase so would not make sense to determine it by > something that the first line of the program does. > On Oct 8, 2014 8:00 AM, "Tao Xiao" wrote: > >> Hi Sean, >> >

ClasssNotFoundExeception was thrown while trying to save rdd

2014-10-12 Thread Tao Xiao
Hi all, I'm using CDH 5.0.1 (Spark 0.9) and submitting a job in Spark Standalone Cluster mode. The job is quite simple as follows: object HBaseApp { def main(args:Array[String]) { testHBase("student", "/test/xt/saveRDD") } def testHBase(tableName: String, outFile:String)

Re: ClasssNotFoundExeception was thrown while trying to save rdd

2014-10-12 Thread Tao Xiao
ing I did not change the object name and file name. 2014-10-13 0:00 GMT+08:00 Ted Yu : > Your app is named scala.HBaseApp > Does it read / write to HBase ? > > Just curious. > > On Sun, Oct 12, 2014 at 8:00 AM, Tao Xiao > wrote: > >> Hi all, >> >> I&

Re: ClasssNotFoundExeception was thrown while trying to save rdd

2014-10-13 Thread Tao Xiao
ing your application jar to the sparkContext will resolve this issue. > > Eg: > sparkContext.addJar("./target/scala-2.10/myTestApp_2.10-1.0.jar") > > Thanks > Best Regards > > On Mon, Oct 13, 2014 at 8:42 AM, Tao Xiao > wrote: > >> In the beginning I trie

All executors run on just a few nodes

2014-10-19 Thread Tao Xiao
Hi all, I have a Spark-0.9 cluster, which has 16 nodes. I wrote a Spark application to read data from an HBase table, which has 86 regions spreading over 20 RegionServers. I submitted the Spark app in Spark standalone mode and found that there were 86 executors running on just 3 nodes and it too

Re: All executors run on just a few nodes

2014-10-19 Thread Tao Xiao
ilable later). > > If this is the case, you could just sleep a few seconds before run the > job. or there are some patches related and providing other way to sync > executors status before running applications, but I haven’t track the > related status for a while. > > > Ray

How to kill a Spark job running in cluster mode ?

2014-11-11 Thread Tao Xiao
I'm using Spark 1.0.0 and I'd like to kill a job running in cluster mode, which means the driver is not running on local node. So how can I kill such a job? Is there a command like "hadoop job -kill " which kills a running MapReduce job ? Thanks

Re: How to kill a Spark job running in cluster mode ?

2014-11-12 Thread Tao Xiao
e has a kill link. You can try using that. >> >> Best Regards, >> Sonal >> Founder, Nube Technologies <http://www.nubetech.co> >> >> <http://in.linkedin.com/in/sonalgoyal> >> >> >> >> On Tue, Nov 11, 2014 at 7:28 PM, Tao Xiao &g

A partitionBy problem

2014-11-18 Thread Tao Xiao
Hi all, I tested *partitionBy *feature in wordcount application, and I'm puzzled by a phenomenon. In this application, I created an rdd from some text files in HDFS(about 100GB in size), each of which has lines composed of words separated by a character "#". I wanted to count the occurence fo

Can not write out data as snappy-compressed files

2014-12-08 Thread Tao Xiao
I'm using CDH 5.1.0 and Spark 1.0.0, and I'd like to write out data as snappy-compressed files but encounted a problem. My code is as follows: val InputTextFilePath = "hdfs://ec2.hadoop.com:8020/xt/text/new.txt" val OutputTextFilePath = "hdfs://ec2.hadoop.com:8020/xt/compressedText/" val

Need some tutorials and examples about customized partitioner

2014-02-25 Thread Tao Xiao
I am a newbie to Spark and I need to know how RDD partitioning can be controlled in the process of shuffling. I have googled for examples but haven't found much concrete examples, in contrast with the fact that there are many good tutorials about Hadoop's shuffling and partitioner. Can anybody sho

Re: Need some tutorials and examples about customized partitioner

2014-02-25 Thread Tao Xiao
> Do not do collect if your data size is huge as this may OOM the driver, > write it to disk in that case. > > > > Scala > > Mayur Rustagi > Ph: +919632149971 > h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com > https://twitter.com/mayur_rust

Re: Need some tutorials and examples about customized partitioner

2014-02-27 Thread Tao Xiao
Also thanks Matei 2014-02-26 15:19 GMT+08:00 Matei Zaharia : > Take a look at the "advanced Spark features" talk here too: > http://ampcamp.berkeley.edu/amp-camp-one-berkeley-2012/. > > Matei > > On Feb 25, 2014, at 6:22 PM, Tao Xiao wrote: > > Thank you Mayu

How to provide a custom Comparator to sortByKey?

2014-02-28 Thread Tao Xiao
I am using Spark 0.9 I have an array of tuples, and I want to sort these tuples using the *sortByKey *API as follows in Spark shell: val A:Array[(String, String)] = Array(("1", "One"), ("9", "Nine"), ("3", "three"), ("5", "five"), ("4", "four")) val P = sc.parallelize(A) // MyComparator is an exa