Hi,
I've been trying to track down some problems with Spark reads being very
slow with s3n:// URIs (NativeS3FileSystem). After some digging around, I
realized that this file system implementation fetches the entire file,
which isn't really a Spark problem, but it really slows down things when
try
l Das wrote:
>
> Depends on which operation you are doing, If you are doing a .count() on a
> parquet, it might not download the entire file i think, but if you do a
> .count() on a normal text file it might pull the entire file.
>
> Thanks
> Best Regards
>
> On Sat, Aug 8,
This will also depend on the file format you are using.
A word of advice: you would be much better off with the s3a file system.
As I found out recently the hard way, s3n has some issues with reading
through entire files even when looking for headers.
On Mon, Aug 17, 2015 at 2:10 AM, Akhil Das
w
I have Spark running in standalone mode with 4 executors, and each executor
with 5 cores each (spark.executor.cores=5). However, when I'm processing
an RDD with ~90,000 partitions, I only get 4 parallel tasks. Shouldn't I
be getting 4x5=20 parallel task executions?
I am seeing a problem with a Spark job in standalone mode. Spark master's
web interface shows a task RUNNING on a particular executor, but the logs
of the executor do not show the task being ever assigned to it, that is,
such a line is missing from the log:
15/02/25 16:53:36 INFO executor.CoarseG
My guess would be that you are packaging too many things in your job, which
is causing problems with the classpath. When your jar goes in first, you
get the correct version of protobuf, but some other version of something
else. When your jar goes in later, other things work, but protobuf
breaks.
Hi,
I'm curious as to how Spark does code generation for SQL queries.
Following through the code, I saw that an expression is parsed and compiled
into a class using Scala reflection toolbox. However, it's unclear to me
whether the actual byte code is generated on the master or on each of the
exe
, Michael Armbrust
wrote:
> It is generated and cached on each of the executors.
>
> On Mon, Apr 6, 2015 at 2:32 PM, Akshat Aranya wrote:
>
>> Hi,
>>
>> I'm curious as to how Spark does code generation for SQL queries.
>>
>> Following through t
Hi,
I'm trying to figure out when TaskCompletionListeners are called -- are
they called at the end of the RDD's compute() method, or after the
iteration through the iterator of the compute() method is completed.
To put it another way, is this OK:
class DatabaseRDD[T] extends RDD[T] {
def comp
Hi,
Is it possible to register kryo serialization for classes contained in jars
that are added with "spark.jars"? In my experiment it doesn't seem to
work, likely because the class registration happens before the jar is
shipped to the executor and added to the classloader. Here's the general
ide
Looking at your classpath, it looks like you've compiled Spark yourself.
Depending on which version of Hadoop you've compiled against (looks like
it's Hadoop 2.2 in your case), Spark will have its own version of
protobuf. You should try by making sure both your HBase and Spark are
compiled against
Hi,
I'm getting a ClassNotFoundException at the executor when trying to
register a class for Kryo serialization:
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstanc
bit more about Schema$MyRow ?
>
> On Fri, May 1, 2015 at 8:05 AM, Akshat Aranya wrote:
>>
>> Hi,
>>
>> I'm getting a ClassNotFoundException at the executor when trying to
>> register a class for Kryo serialization
I cherry-picked the fix for SPARK-5470 and the problem has gone away.
On Fri, May 1, 2015 at 9:15 AM, Akshat Aranya wrote:
> Yes, this class is present in the jar that was loaded in the classpath
> of the executor Java process -- it wasn't even lazily added as a part
> of the
(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
I verified that the same configuration works without using Kryo serialization.
On Fri, May 1, 2015 at 9:44 AM, Akshat Aranya wrote:
> I cherry-picked the fix for SPARK-5470 and the problem has gone away.
>
> On Fri, May 1,
pushing the jars to the cluster manually, and then using
> spark.executor.extraClassPath
>
> On Wed, Apr 29, 2015 at 6:42 PM, Akshat Aranya wrote:
>>
>> Hi,
>>
>> Is it possible to register kryo serialization for classes contained in
>> jars that are added wit
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
... 21 more
On Mon, May 4, 2015 at 5:43 PM, Akshat Aranya wrote
Hi,
I'm trying to implement a custom RDD that essentially works as a
distributed hash table, i.e. the key space is split up into partitions and
within a partition, an element can be looked up efficiently by the key.
However, the RDD lookup() function (in PairRDDFunctions) is implemented in
a way i
I have a use case where my RDD is set up such:
Partition 0:
K1 -> [V1, V2]
K2 -> [V2]
Partition 1:
K3 -> [V1]
K4 -> [V3]
I want to invert this RDD, but only within a partition, so that the
operation does not require a shuffle. It doesn't matter if the partitions
of the inverted RDD have non uni
struct a hash map within each partition
> yourself.
>
> On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya wrote:
> > I have a use case where my RDD is set up such:
> >
> > Partition 0:
> > K1 -> [V1, V2]
> > K2 -> [V2]
> >
> > Partition 1:
&
Hi,
What's the relationship between Spark worker and executor memory settings
in standalone mode? Do they work independently or does the worker cap
executor memory?
Also, is the number of concurrent executors per worker capped by the number
of CPU cores configured for the worker?
cution within a
program? Does that also mean that two concurrent jobs will get one
executor each at the same time?
>
> On Wed, Oct 1, 2014 at 11:04 AM, Akshat Aranya wrote:
>
>> Hi,
>>
>> What's the relationship between Spark worker and executor memory settings
>&g
On Wed, Oct 1, 2014 at 11:33 AM, Akshat Aranya wrote:
>
>
> On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas wrote:
>
>> 1. worker memory caps executor.
>> 2. With default config, every job gets one executor per worker. This
>> executor runs with all cores availab
way to control the cores yet. This effectively limits
> the cluster to a single application at a time. A subsequent application
> shows in the 'WAITING' State on the dashboard.
>
> On Wed, Oct 1, 2014 at 2:49 PM, Akshat Aranya wrote:
>
>>
>>
>> On W
Hi,
I want implement an RDD wherein the decision of number of partitions is
based on the number of executors that have been set up. Is there some way I
can determine the number of executors within the getPartitions() call?
Using a var for RDDs in this way is not going to work. In this example,
tx1.zip(tx2) would create and RDD that depends on tx2, but then soon after
that, you change what tx2 means, so you would end up having a circular
dependency.
On Wed, Oct 8, 2014 at 12:01 PM, Sung Hwan Chung
wrote:
> My job
Hi,
Is there a good way to materialize derivate RDDs from say, a HadoopRDD
while reading in the data only once. One way to do so would be to cache
the HadoopRDD and then create derivative RDDs, but that would require
enough RAM to cache the HadoopRDD which is not an option in my case.
Thanks,
Ak
I just want to pitch in and say that I ran into the same problem with
running with 64GB executors. For example, some of the tasks take 5 minutes
to execute, out of which 4 minutes are spent in GC. I'll try out smaller
executors.
On Mon, Oct 6, 2014 at 6:35 PM, Otis Gospodnetic wrote:
> Hi,
>
>
Hi,
Can anyone explain how things get captured in a closure when runing through
the REPL. For example:
def foo(..) = { .. }
rdd.map(foo)
sometimes complains about classes not being serializable that are
completely unrelated to foo. This happens even when I write it such:
object Foo {
def f
Hi,
How can I convert an RDD loaded from a Parquet file into its original type:
case class Person(name: String, age: Int)
val rdd: RDD[Person] = ...
rdd.saveAsParquetFile("people")
val rdd2: sqlContext.parquetFile("people")
How can I map rdd2 back into an RDD[Person]? All of the examples just
There seems to be some problem with what gets captured in the closure
that's passed into the mapPartitions (myfunc in your case).
I've had a similar problem before:
http://apache-spark-user-list.1001560.n3.nabble.com/TaskNotSerializableException-when-running-through-Spark-shell-td16574.html
Try
This is as much of a Scala question as a Spark question
I have an RDD:
val rdd1: RDD[(Long, Array[Long])]
This RDD has duplicate keys that I can collapse such
val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) => a++b)
If I start with an Array of primitive longs in rdd1, will rdd2 als
Spark, in general, is good for iterating through an entire dataset again
and again. All operations are expressed in terms of iteration through all
the records of at least one partition. You may want to look at IndexedRDD (
https://issues.apache.org/jira/browse/SPARK-2365) that aims to improve
poi
Hi,
Does there exist a way to serialize Row objects to JSON. In the absence of
such a way, is the right way to go:
* get hold of schema using SchemaRDD.schema
* Iterate through each individual Row as a Seq and use the schema to
convert values in the row to JSON types.
Thanks,
Akshat
Hi,
Sorry if this has been asked before; I didn't find a satisfactory answer
when searching. How can I integrate a Play application with Spark? I'm
getting into issues of akka-actor versions. Play 2.2.x uses akka-actor
2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine
wi
Parquet is a column-oriented format, which means that you need to read in
less data from the file system if you're only interested in a subset of
your columns. Also, Parquet pushes down selection predicates, which can
eliminate needless deserialization of rows that don't match a selection
criterio
Is it possible to enable the Yarn profile while building Spark with sbt?
It seems like yarn project is strictly a Maven project and not something
that's known to the sbt parent project.
-Akshat
Hi,
I have a question regarding failure of executors: how does Spark reassign
partitions or tasks when executors fail? Is it necessary that new
executors have the same executor IDs as the ones that were lost, or are
these IDs irrelevant for failover?
Is it possible to have some state across multiple calls to mapPartitions on
each partition, for instance, if I want to keep a database connection open?
do you nerd to do with db cpnnection?
>
> Paolo
>
> Inviata dal mio Windows Phone
> ------
> Da: Akshat Aranya
> Inviato: 04/12/2014 18:57
> A: user@spark.apache.org
> Oggetto: Stateful mapPartitions
>
> Is it possible to have some state
Hi,
I am building a Spark-based service which requires initialization of a
SparkContext in a main():
def main(args: Array[String]) {
val conf = new SparkConf(false)
.setMaster("spark://foo.example.com:7077")
.setAppName("foobar")
val sc = new SparkContext(conf)
val rdd =
41 matches
Mail list logo