Re: HBase row count

2014-02-24 Thread Soumitra Kumar
I did try with 'hBaseRDD.cache()', but don't see any improvement. My expectation is that with cache enabled, there should not be any penalty of 'hBaseRDD.count' call. On Mon, Feb 24, 2014 at 11:29 PM, Nick Pentreath wrote: > Yes, you''re initiating a scan for each count call. The normal way to

Re: HBase row count

2014-02-24 Thread Nick Pentreath
Yes, you''re initiating a scan for each count call. The normal way to improve this would be to use cache(), which is what you have in your commented out line: // hBaseRDD.cache() If you uncomment that line, you should see an improvement overall. If caching is not an option for some reason (maybe

HBase row count

2014-02-24 Thread Soumitra Kumar
I have a code which reads an HBase table, and counts number of rows containing a field. def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) : RDD[List[Array[Byte]]] = { return rdd.flatMap(kv => { // Set of interesting keys for this use case val keys = Li

[no subject]

2014-02-24 Thread Soumitra Kumar
I have a code which reads an HBase table, and counts number of rows containing a field. def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) : RDD[List[Array[Byte]]] = { return rdd.flatMap(kv => { // Set of interesting keys for this use case val keys = Li

Re: Spark performance optimization

2014-02-24 Thread Roshan Nair
Hi, We use sequence files as input as well. Spark creates a task for each part* file by default. We use RDD.coalesce (set to number of cores or 2*number of cores). This helps when there are many more part* files than the number of cores and each part* file is relatively small. Coalesce doesn't act

Re: Job initialization performance of Spark standalone mode vs YARN

2014-02-24 Thread Mayur Rustagi
Mayur Rustagi Ph: +919632149971 h ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Mon, Feb 24, 2014 at 10:22 PM, polkosity wrote: > Is there any difference in the performance of Spark standalone mode and > YARN > when it comes to initializ

Re: Spark performance optimization

2014-02-24 Thread Andrew Ash
Have you tried using a standalone spark cluster vs a YARN one? I get the impression that standalone responses are faster (the JVMs are already all running) but haven't done any rigorous testing (and have only used standalone so far). On Mon, Feb 24, 2014 at 10:43 PM, polkosity wrote: > As ment

Spark performance optimization

2014-02-24 Thread polkosity
As mentioned in a previous post, I have an application which relies on a quick response. The application matches a client's image against a set of stored images. Image features are stored in a SequenceFile and passed over JNI to match in OpenCV, along with the features for the client's image. An

WARNING: Spark lists moving to spark.apache.org domain name

2014-02-24 Thread Matei Zaharia
Hi everyone, As you may have noticed, our lists are currently in the process of being migrated from @spark.incubator.apache.org domain names to @spark.apache.org, as part of the project becoming a top-level project. Please beware that messages will come to the new lists and you’ll have to adjus

Job initialization performance of Spark standalone mode vs YARN

2014-02-24 Thread polkosity
Is there any difference in the performance of Spark standalone mode and YARN when it comes to initializing a new Spark job? In my application, response time is absolutely critical, and I'm hoping to have the executors working within a few seconds of submitting the job. Both options ran quickly

答复: Can spark-streaming work with spark-on-yarn mode?

2014-02-24 Thread 林武康
Thank you Tathagata, I will try it out later. -原始邮件- 发件人: "Tathagata Das" 发送时间: ‎2014/‎2/‎22 11:12 收件人: "u...@spark.incubator.apache.org" 主题: Re: Can spark-streaming work with spark-on-yarn mode? Yes, Spark and Spark Streaming programs can be deployed on YARN. Here is the documentatio

Re: apparently non-critical errors running spark-ec2 launch

2014-02-24 Thread Nicholas Chammas
Alright, that's good to know. And I guess the first of these errors can be prevented by increasing the wait time via --wait. Thank you. Nick On Mon, Feb 24, 2014 at 9:04 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > Replies inline > > On Mon, Feb 24, 2014 at 5:26 PM, nichola

Re: How to get well-distribute partition

2014-02-24 Thread zhaoxw12
Thanks for your reply. For some reasons, I have to use python in my program. I can't find the API of RangePartitioner. Could you tell me more details? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-well-distribute-partition-tp2002p2013.html S

spark failure

2014-02-24 Thread Nathan Kronenfeld
I'm using spark 0.8.1, and trying to run a job from a new remote client (it works fine when run directly from the master). When I try and run it, the job just fails without doing anything. Unfortunately, I also can't find anywhere were it tells me why it fails. I'll add the bits of the logs belo

Re: How to get well-distribute partition

2014-02-24 Thread Mayur Rustagi
Easiest is to plugin your own partitioner if you know the nature of the data. If you dont then you can sample the data for creating the partitions weight, you can use RangePartitioner out of the box. Mayur Rustagi Ph: +919632149971 h ttp://www.sigmoidanalytics.co

Re: Filter on Date by comparing

2014-02-24 Thread Andrew Ash
It's in the data serialization section of the tuning guide, here: http://spark.incubator.apache.org/docs/latest/tuning.html#data-serialization On Mon, Feb 24, 2014 at 7:44 PM, Soumya Simanta wrote: > Thanks Andrew. I was expecting this to be the issue. > Are there any pointers about how to chan

Re: Filter on Date by comparing

2014-02-24 Thread Soumya Simanta
Thanks Andrew. I was expecting this to be the issue. Are there any pointers about how to change the serialization to Kryo ? On Mon, Feb 24, 2014 at 10:17 PM, Andrew Ash wrote: > This is because Joda's DateTimeFormatter is not serializable (doesn't > implement the empty Serializable interface)

Re: Filter on Date by comparing

2014-02-24 Thread Ewen Cheslack-Postava
Or use RDD.filterWith to create whatever you need out of serializable parts so you only run it once per partition. Andrew Ash February 24, 2014 at 7:17 PM This is because Joda's DateTimeFormatter is not serializable (doesn't implement the empty Serializable interface) ht

Re: Filter on Date by comparing

2014-02-24 Thread Andrew Ash
This is because Joda's DateTimeFormatter is not serializable (doesn't implement the empty Serializable interface) http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html One ugly thing I've done before is to instantiate a new DateTimeFormatter in every line, so like this:

Re: java.io.NotSerializableException Of dependent Java lib.

2014-02-24 Thread yaoxin
Does this means that every class I used in Spark must be serializable? Even the class that I dependent on? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-Of-dependent-Java-lib-tp1973p2006.html Sent from the Apache Spark User

Running GraphX example from Scala REPL

2014-02-24 Thread Soumya Simanta
I'm trying to run the GraphX examples from the Scala REPL. However, it complains that it cannot find RDD. :23: error: not found: type RDD val users: RDD[(VertexId, (String, String))] = I'm using a Feb 3 commit of incubator spark. Should I do anything differently to build GraphX ? or is

Filter on Date by comparing

2014-02-24 Thread Soumya Simanta
I want to filter a RDD by comparing dates. myRDD.filter( x => new DateTime(x.getCreatedAt).isAfter(start) ).count I'm using the JodaTime library but I get an exception about a Jodatime class not serializable. Is there a way to configure this or an easier alternative for this problem. org.apac

Is it necessary to call setID in SparkHadoopWriter.scala

2014-02-24 Thread haosdent
Hi, folks. I read the spark sources, and couldn't understand why we should call setID(), setConfParams(), commit() in SparkHadoopWriter.scala. I think we just need create a RecordWriter and write(k, v). Anything I miss? -- Best Regards, Haosdent Huang