Re: Question about Spark best practice when counting records.

2015-02-27 Thread Kostas Sakellis
Hey Darin, Record count metrics are coming in Spark 1.3. Can you wait until it is released? Or do you need a solution in older versions of spark. Kostas On Friday, February 27, 2015, Darin McBeath wrote: > I have a fairly large Spark job where I'm essentially creating quite a few > RDDs, do se

Re: textFile partitions

2015-02-09 Thread Kostas Sakellis
The partitions parameter to textFile is the "minPartitions". So there will be at least that level of parallelism. Spark delegates to Hadoop to create the splits for that file (yes, even for a text file on disk and not hdfs). You can take a look at the code in FileInputFormat - but briefly it will c

Re: Whether standalone spark support kerberos?

2015-02-05 Thread Kostas Sakellis
Standalone mode does not support talking to a kerberized HDFS. If you want to talk to a kerberized (secure) HDFS cluster i suggest you use Spark on Yarn. On Wed, Feb 4, 2015 at 2:29 AM, Jander g wrote: > Hope someone helps me. Thanks. > > On Wed, Feb 4, 2015 at 6:14 PM, Jander g wrote: > >> We

Re: How many stages in my application?

2015-02-05 Thread Kostas Sakellis
Yes, there is no way right now to know how many stages a job will generate automatically. Like Mark said, RDD#toDebugString will give you some info about the RDD DAG and from that you can determine based on the dependency types (Wide vs. narrow) if there is a stage boundary. On Thu, Feb 5, 2015 at

Re: Reg Job Server

2015-02-05 Thread Kostas Sakellis
On Thu, Feb 5, 2015 at 9:03 PM, Deep Pradhan wrote: > I read somewhere about Gatling. Can that be used to profile Spark jobs? > > On Fri, Feb 6, 2015 at 10:27 AM, Kostas Sakellis > wrote: > >> Which Spark Job server are you talking about? >> >> On Thu, Feb 5, 20

Re: spark driver behind firewall

2015-02-05 Thread Kostas Sakellis
Yes, the driver has to be able to accept incoming connections. All the executors connect back to the driver sending heartbeats, map status, metrics. It is critical and I don't know of a way around it. You could look into using something like the https://github.com/spark-jobserver/spark-jobserver th

Re: Reg Job Server

2015-02-05 Thread Kostas Sakellis
Which Spark Job server are you talking about? On Thu, Feb 5, 2015 at 8:28 PM, Deep Pradhan wrote: > Hi, > Can Spark Job Server be used for profiling Spark jobs? >

Re: Spark Job running on localhost on yarn cluster

2015-02-05 Thread Kostas Sakellis
Kundan, So I think your configuration here is incorrect. We need to adjust memory and #executors. So for your case you have: Cluster setup 5 nodes 16gb RAM 8 cores. The number of executors should be the total number of nodes in your cluster - in your case 5. As for --num-executor-cores it should

Re: Yarn Driver OOME (Java heap space) when executors request map output locations

2014-09-09 Thread Kostas Sakellis
Hey, If you are interested in more details there is also a thread about this issue here: http://apache-spark-developers-list.1001551.n3.nabble.com/Eliminate-copy-while-sending-data-any-Akka-experts-here-td7127.html Kostas On Tue, Sep 9, 2014 at 3:01 PM, jbeynon wrote: > Thanks Marcelo, that lo