Accessing S3 files with s3n://

2015-08-07 Thread Akshat Aranya
Hi, I've been trying to track down some problems with Spark reads being very slow with s3n:// URIs (NativeS3FileSystem). After some digging around, I realized that this file system implementation fetches the entire file, which isn't really a Spark problem, but it really slows down things when try

Re: Accessing S3 files with s3n://

2015-08-10 Thread Akshat Aranya
l Das wrote: > > Depends on which operation you are doing, If you are doing a .count() on a > parquet, it might not download the entire file i think, but if you do a > .count() on a normal text file it might pull the entire file. > > Thanks > Best Regards > > On Sat, Aug 8,

Re: S3n, parallelism, partitions

2015-08-17 Thread Akshat Aranya
This will also depend on the file format you are using. A word of advice: you would be much better off with the s3a file system. As I found out recently the hard way, s3n has some issues with reading through entire files even when looking for headers. On Mon, Aug 17, 2015 at 2:10 AM, Akhil Das w

Number of parallel tasks

2015-02-25 Thread Akshat Aranya
I have Spark running in standalone mode with 4 executors, and each executor with 5 cores each (spark.executor.cores=5). However, when I'm processing an RDD with ~90,000 partitions, I only get 4 parallel tasks. Shouldn't I be getting 4x5=20 parallel task executions?

Missing tasks

2015-02-26 Thread Akshat Aranya
I am seeing a problem with a Spark job in standalone mode. Spark master's web interface shows a task RUNNING on a particular executor, but the logs of the executor do not show the task being ever assigned to it, that is, such a line is missing from the log: 15/02/25 16:53:36 INFO executor.CoarseG

Re: Getting to proto buff classes in Spark Context

2015-02-26 Thread Akshat Aranya
My guess would be that you are packaging too many things in your job, which is causing problems with the classpath. When your jar goes in first, you get the correct version of protobuf, but some other version of something else. When your jar goes in later, other things work, but protobuf breaks.

Spark SQL code generation

2015-04-06 Thread Akshat Aranya
Hi, I'm curious as to how Spark does code generation for SQL queries. Following through the code, I saw that an expression is parsed and compiled into a class using Scala reflection toolbox. However, it's unclear to me whether the actual byte code is generated on the master or on each of the exe

Re: Spark SQL code generation

2015-04-06 Thread Akshat Aranya
, Michael Armbrust wrote: > It is generated and cached on each of the executors. > > On Mon, Apr 6, 2015 at 2:32 PM, Akshat Aranya wrote: > >> Hi, >> >> I'm curious as to how Spark does code generation for SQL queries. >> >> Following through t

When are TaskCompletionListeners called?

2015-04-17 Thread Akshat Aranya
Hi, I'm trying to figure out when TaskCompletionListeners are called -- are they called at the end of the RDD's compute() method, or after the iteration through the iterator of the compute() method is completed. To put it another way, is this OK: class DatabaseRDD[T] extends RDD[T] { def comp

Kryo serialization of classes in additional jars

2015-04-29 Thread Akshat Aranya
Hi, Is it possible to register kryo serialization for classes contained in jars that are added with "spark.jars"? In my experiment it doesn't seem to work, likely because the class registration happens before the jar is shipped to the executor and added to the classloader. Here's the general ide

Re: spark with standalone HBase

2015-04-30 Thread Akshat Aranya
Looking at your classpath, it looks like you've compiled Spark yourself. Depending on which version of Hadoop you've compiled against (looks like it's Hadoop 2.2 in your case), Spark will have its own version of protobuf. You should try by making sure both your HBase and Spark are compiled against

ClassNotFoundException for Kryo serialization

2015-05-01 Thread Akshat Aranya
Hi, I'm getting a ClassNotFoundException at the executor when trying to register a class for Kryo serialization: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstanc

Re: ClassNotFoundException for Kryo serialization

2015-05-01 Thread Akshat Aranya
bit more about Schema$MyRow ? > > On Fri, May 1, 2015 at 8:05 AM, Akshat Aranya wrote: >> >> Hi, >> >> I'm getting a ClassNotFoundException at the executor when trying to >> register a class for Kryo serialization

Re: ClassNotFoundException for Kryo serialization

2015-05-01 Thread Akshat Aranya
I cherry-picked the fix for SPARK-5470 and the problem has gone away. On Fri, May 1, 2015 at 9:15 AM, Akshat Aranya wrote: > Yes, this class is present in the jar that was loaded in the classpath > of the executor Java process -- it wasn't even lazily added as a part > of the

Re: ClassNotFoundException for Kryo serialization

2015-05-02 Thread Akshat Aranya
(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) I verified that the same configuration works without using Kryo serialization. On Fri, May 1, 2015 at 9:44 AM, Akshat Aranya wrote: > I cherry-picked the fix for SPARK-5470 and the problem has gone away. > > On Fri, May 1,

Re: Kryo serialization of classes in additional jars

2015-05-04 Thread Akshat Aranya
pushing the jars to the cluster manually, and then using > spark.executor.extraClassPath > > On Wed, Apr 29, 2015 at 6:42 PM, Akshat Aranya wrote: >> >> Hi, >> >> Is it possible to register kryo serialization for classes contained in >> jars that are added wit

Re: Kryo serialization of classes in additional jars

2015-05-13 Thread Akshat Aranya
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93) ... 21 more On Mon, May 4, 2015 at 5:43 PM, Akshat Aranya wrote

Indexed RDD

2014-09-16 Thread Akshat Aranya
Hi, I'm trying to implement a custom RDD that essentially works as a distributed hash table, i.e. the key space is split up into partitions and within a partition, an element can be looked up efficiently by the key. However, the RDD lookup() function (in PairRDDFunctions) is implemented in a way i

partitioned groupBy

2014-09-16 Thread Akshat Aranya
I have a use case where my RDD is set up such: Partition 0: K1 -> [V1, V2] K2 -> [V2] Partition 1: K3 -> [V1] K4 -> [V3] I want to invert this RDD, but only within a partition, so that the operation does not require a shuffle. It doesn't matter if the partitions of the inverted RDD have non uni

Re: partitioned groupBy

2014-09-17 Thread Akshat Aranya
struct a hash map within each partition > yourself. > > On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya wrote: > > I have a use case where my RDD is set up such: > > > > Partition 0: > > K1 -> [V1, V2] > > K2 -> [V2] > > > > Partition 1: &

Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
Hi, What's the relationship between Spark worker and executor memory settings in standalone mode? Do they work independently or does the worker cap executor memory? Also, is the number of concurrent executors per worker capped by the number of CPU cores configured for the worker?

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
cution within a program? Does that also mean that two concurrent jobs will get one executor each at the same time? > > On Wed, Oct 1, 2014 at 11:04 AM, Akshat Aranya wrote: > >> Hi, >> >> What's the relationship between Spark worker and executor memory settings >&g

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
On Wed, Oct 1, 2014 at 11:33 AM, Akshat Aranya wrote: > > > On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas wrote: > >> 1. worker memory caps executor. >> 2. With default config, every job gets one executor per worker. This >> executor runs with all cores availab

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
way to control the cores yet. This effectively limits > the cluster to a single application at a time. A subsequent application > shows in the 'WAITING' State on the dashboard. > > On Wed, Oct 1, 2014 at 2:49 PM, Akshat Aranya wrote: > >> >> >> On W

Determining number of executors within RDD

2014-10-01 Thread Akshat Aranya
Hi, I want implement an RDD wherein the decision of number of partitions is based on the number of executors that have been set up. Is there some way I can determine the number of executors within the getPartitions() call?

Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-08 Thread Akshat Aranya
Using a var for RDDs in this way is not going to work. In this example, tx1.zip(tx2) would create and RDD that depends on tx2, but then soon after that, you change what tx2 means, so you would end up having a circular dependency. On Wed, Oct 8, 2014 at 12:01 PM, Sung Hwan Chung wrote: > My job

One pass compute() to produce multiple RDDs

2014-10-09 Thread Akshat Aranya
Hi, Is there a good way to materialize derivate RDDs from say, a HadoopRDD while reading in the data only once. One way to do so would be to cache the HadoopRDD and then create derivative RDDs, but that would require enough RAM to cache the HadoopRDD which is not an option in my case. Thanks, Ak

Re: Larger heap leads to perf degradation due to GC

2014-10-16 Thread Akshat Aranya
I just want to pitch in and say that I ran into the same problem with running with 64GB executors. For example, some of the tasks take 5 minutes to execute, out of which 4 minutes are spent in GC. I'll try out smaller executors. On Mon, Oct 6, 2014 at 6:35 PM, Otis Gospodnetic wrote: > Hi, > >

TaskNotSerializableException when running through Spark shell

2014-10-16 Thread Akshat Aranya
Hi, Can anyone explain how things get captured in a closure when runing through the REPL. For example: def foo(..) = { .. } rdd.map(foo) sometimes complains about classes not being serializable that are completely unrelated to foo. This happens even when I write it such: object Foo { def f

Attaching schema to RDD created from Parquet file

2014-10-17 Thread Akshat Aranya
Hi, How can I convert an RDD loaded from a Parquet file into its original type: case class Person(name: String, age: Int) val rdd: RDD[Person] = ... rdd.saveAsParquetFile("people") val rdd2: sqlContext.parquetFile("people") How can I map rdd2 back into an RDD[Person]? All of the examples just

Re: bug with MapPartitions?

2014-10-17 Thread Akshat Aranya
There seems to be some problem with what gets captured in the closure that's passed into the mapPartitions (myfunc in your case). I've had a similar problem before: http://apache-spark-user-list.1001560.n3.nabble.com/TaskNotSerializableException-when-running-through-Spark-shell-td16574.html Try

Primitive arrays in Spark

2014-10-21 Thread Akshat Aranya
This is as much of a Scala question as a Spark question I have an RDD: val rdd1: RDD[(Long, Array[Long])] This RDD has duplicate keys that I can collapse such val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) => a++b) If I start with an Array of primitive longs in rdd1, will rdd2 als

Re: Spark as key/value store?

2014-10-22 Thread Akshat Aranya
Spark, in general, is good for iterating through an entire dataset again and again. All operations are expressed in terms of iteration through all the records of at least one partition. You may want to look at IndexedRDD ( https://issues.apache.org/jira/browse/SPARK-2365) that aims to improve poi

Mapping SchemaRDD/Row to JSON

2014-11-10 Thread Akshat Aranya
Hi, Does there exist a way to serialize Row objects to JSON. In the absence of such a way, is the right way to go: * get hold of schema using SchemaRDD.schema * Iterate through each individual Row as a Seq and use the schema to convert values in the row to JSON types. Thanks, Akshat

Spark and Play

2014-11-11 Thread Akshat Aranya
Hi, Sorry if this has been asked before; I didn't find a satisfactory answer when searching. How can I integrate a Play application with Spark? I'm getting into issues of akka-actor versions. Play 2.2.x uses akka-actor 2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine wi

Re: advantages of SparkSQL?

2014-11-24 Thread Akshat Aranya
Parquet is a column-oriented format, which means that you need to read in less data from the file system if you're only interested in a subset of your columns. Also, Parquet pushes down selection predicates, which can eliminate needless deserialization of rows that don't match a selection criterio

Building Yarn mode with sbt

2014-11-24 Thread Akshat Aranya
Is it possible to enable the Yarn profile while building Spark with sbt? It seems like yarn project is strictly a Maven project and not something that's known to the sbt parent project. -Akshat

Executor failover

2014-11-26 Thread Akshat Aranya
Hi, I have a question regarding failure of executors: how does Spark reassign partitions or tasks when executors fail? Is it necessary that new executors have the same executor IDs as the ones that were lost, or are these IDs irrelevant for failover?

Stateful mapPartitions

2014-12-04 Thread Akshat Aranya
Is it possible to have some state across multiple calls to mapPartitions on each partition, for instance, if I want to keep a database connection open?

Re: Stateful mapPartitions

2014-12-04 Thread Akshat Aranya
do you nerd to do with db cpnnection? > > Paolo > > Inviata dal mio Windows Phone > ------ > Da: Akshat Aranya > Inviato: ‎04/‎12/‎2014 18:57 > A: user@spark.apache.org > Oggetto: Stateful mapPartitions > > Is it possible to have some state

Standalone Spark program

2014-12-18 Thread Akshat Aranya
Hi, I am building a Spark-based service which requires initialization of a SparkContext in a main(): def main(args: Array[String]) { val conf = new SparkConf(false) .setMaster("spark://foo.example.com:7077") .setAppName("foobar") val sc = new SparkContext(conf) val rdd =