Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nicholas Chammas
If you create your own Spark 2.x ML Transformer, there are multiple mix-ins
(is that the correct term?) that you can use to define its behavior which
are in ml/param/shared.py

.

Among them are the following mix-ins:

   - HasInputCol
   - HasInputCols
   - HasOutputCol

What’s *not* available is a HasOutputCols mix-in, and I assume that is
intentional.

Is there a design reason why Transformers should not be able to define
multiple output columns?

I’m guessing if you are an ML beginner who thinks they need a Transformer
with multiple output columns, you’ve misunderstood something. 😅

Nick
​


Re: Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nick Pentreath
It's not impossible that a Transformer could output multiple columns - it's
simply because none of the current ones do. It's true that it might be a
relatively less common use case in general.

But take StringIndexer for example. It turns strings (categorical features)
into ints (0-based indexes). It could (should) accept multiple input
columns for efficiency (see
https://issues.apache.org/jira/browse/SPARK-11215). This is a case where
multiple output columns would be required.

N

On Tue, 23 Aug 2016 at 16:15 Nicholas Chammas 
wrote:

> If you create your own Spark 2.x ML Transformer, there are multiple
> mix-ins (is that the correct term?) that you can use to define its behavior
> which are in ml/param/shared.py
> 
> .
>
> Among them are the following mix-ins:
>
>- HasInputCol
>- HasInputCols
>- HasOutputCol
>
> What’s *not* available is a HasOutputCols mix-in, and I assume that is
> intentional.
>
> Is there a design reason why Transformers should not be able to define
> multiple output columns?
>
> I’m guessing if you are an ML beginner who thinks they need a Transformer
> with multiple output columns, you’ve misunderstood something. 😅
>
> Nick
> ​
>


Re: Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nicholas Chammas
Thanks for the pointer! A linked issue from the one you shared also appears
to be relevant.

SPARK-8418 : "Add single-
and multi-value support to ML Transformers"

On Tue, Aug 23, 2016 at 10:41 AM Nick Pentreath 
wrote:

> It's not impossible that a Transformer could output multiple columns -
> it's simply because none of the current ones do. It's true that it might be
> a relatively less common use case in general.
>
> But take StringIndexer for example. It turns strings (categorical
> features) into ints (0-based indexes). It could (should) accept multiple
> input columns for efficiency (see
> https://issues.apache.org/jira/browse/SPARK-11215). This is a case where
> multiple output columns would be required.
>
> N
>
>
> On Tue, 23 Aug 2016 at 16:15 Nicholas Chammas 
> wrote:
>
>> If you create your own Spark 2.x ML Transformer, there are multiple
>> mix-ins (is that the correct term?) that you can use to define its behavior
>> which are in ml/param/shared.py
>> 
>> .
>>
>> Among them are the following mix-ins:
>>
>>- HasInputCol
>>- HasInputCols
>>- HasOutputCol
>>
>> What’s *not* available is a HasOutputCols mix-in, and I assume that is
>> intentional.
>>
>> Is there a design reason why Transformers should not be able to define
>> multiple output columns?
>>
>> I’m guessing if you are an ML beginner who thinks they need a Transformer
>> with multiple output columns, you’ve misunderstood something. 😅
>>
>> Nick
>> ​
>>
>


Fwd: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-23 Thread Michael Allman
FYI, I posted this to user@ and have followed up with a bug report: 
https://issues.apache.org/jira/browse/SPARK-17204 


Michael

> Begin forwarded message:
> 
> From: Michael Allman 
> Subject: Anyone else having trouble with replicated off heap RDD persistence?
> Date: August 16, 2016 at 3:45:14 PM PDT
> To: user 
> 
> Hello,
> 
> A coworker was having a problem with a big Spark job failing after several 
> hours when one of the executors would segfault. That problem aside, I 
> speculated that her job would be more robust against these kinds of executor 
> crashes if she used replicated RDD storage. She's using off heap storage (for 
> good reason), so I asked her to try running her job with the following 
> storage level: `StorageLevel(useDisk = true, useMemory = true, useOffHeap = 
> true, deserialized = false, replication = 2)`. The job would immediately fail 
> with a rather suspicious looking exception. For example:
> 
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 
> or
> 
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collect

Serialization troubles with mutable.LinkedHashMap

2016-08-23 Thread Rahul Palamuttam
Hi,

I initially send this on the user mailing list, however I didn't get any
response.
I figured this could be a bug so it might of more concern to the dev-list.

I recently switched to using kryo serialization and I've been running into
errors
with the mutable.LinkedHashMap class.

If I don't register the mutable.LinkedHashMap class then I get an
ArrayStoreException seen below.
If I do register the class, then when the LinkedHashMap is collected on the
driver, it does not contain any elements.

Here is the snippet of code I used :

val sc = new SparkContext(new SparkConf()
  .setMaster("local[*]")
  .setAppName("Sample")
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .registerKryoClasses(Array(classOf[mutable.LinkedHashMap[String, String]])))

val collect = sc.parallelize(0 to 10)
  .map(p => new mutable.LinkedHashMap[String, String]() ++=
Array(("hello", "bonjour"), ("good", "bueno")))

val mapSideSizes = collect.map(p => p.size).collect()(0)
val driverSideSizes = collect.collect()(0).size

println("The sizes before collect : " + mapSideSizes)
println("The sizes after collect : " + driverSideSizes)


** The following only occurs if I did not register the
mutable.LinkedHashMap class **
16/08/20 18:10:38 ERROR TaskResultGetter: Exception while getting task
result
java.lang.ArrayStoreException: scala.collection.mutable.HashMap
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$
ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$
ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at org.apache.spark.serializer.KryoSerializerInstance.
deserialize(KryoSerializer.scala:311)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:97)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$
anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:60)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(
TaskResultGetter.scala:51)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(
TaskResultGetter.scala:51)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(
TaskResultGetter.scala:50)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I hope this is a known issue and/or I'm missing something important in my
setup.
Appreciate any help or advice!

As a bit of background this was encountered in the SciSpark project being
developed at NASA JPL.
The mutable.LinkedHashMap is necessary as it enables us to deal with Netcdf
attributes in the order they appear in the original Netcdf files.
The test case I posted above was just to show the error I'm seeing more
clearly.
Our actual use case is slightly different, but we see the same result
(empty HashMaps)..

Rahul Palamuttam


Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-23 Thread Reynold Xin
Does this problem still exist on today's master/branch-2.0?

SPARK-16550 was merged. It might be fixed already.

On Tue, Aug 23, 2016 at 9:37 AM, Michael Allman 
wrote:

> FYI, I posted this to user@ and have followed up with a bug report:
> https://issues.apache.org/jira/browse/SPARK-17204
>
> Michael
>
> Begin forwarded message:
>
> *From: *Michael Allman 
> *Subject: **Anyone else having trouble with replicated off heap RDD
> persistence?*
> *Date: *August 16, 2016 at 3:45:14 PM PDT
> *To: *user 
>
> Hello,
>
> A coworker was having a problem with a big Spark job failing after several
> hours when one of the executors would segfault. That problem aside, I
> speculated that her job would be more robust against these kinds of
> executor crashes if she used replicated RDD storage. She's using off heap
> storage (for good reason), so I asked her to try running her job with the
> following storage level: `StorageLevel(useDisk = true, useMemory = true,
> useOffHeap = true, deserialized = false, replication = 2)`. The job would
> immediately fail with a rather suspicious looking exception. For example:
>
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class
> ID: 9086
> at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(
> DefaultClassResolver.java:137)
> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
> at org.apache.spark.serializer.KryoDeserializationStream.
> readObject(KryoSerializer.scala:229)
> at org.apache.spark.serializer.DeserializationStream$$anon$1.
> getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at org.apache.spark.util.CompletionIterator.hasNext(
> CompletionIterator.scala:32)
> at org.apache.spark.InterruptibleIterator.hasNext(
> InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificColumnarIterator.hasNext(Unknown Source)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.processNext(Unknown Source)
> at org.apache.spark.sql.execution.BufferedRowIterator.
> hasNext(BufferedRowIterator.java:43)
> at org.apache.spark.sql.execution.WholeStageCodegenExec$$
> anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(
> BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:79)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> or
>
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
> at java.util.ArrayList.rangeCheck(ArrayList.java:653)
> at java.util.ArrayList.get(ArrayList.java:429)
> at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(
> MapReferenceResolver.java:60)
> at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
> at org.apache.spark.serializer.KryoDeserializationStream.
> readObject(KryoSerializer.scala:229)
> at org.apache.spark.serializer.DeserializationStream$$anon$1.
> getNext(Serializer.scala:169)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at org.apache.spark.util.CompletionIterator.hasNext(
> CompletionIterator.scala:32)
> at org.apache.spark.InterruptibleIterator.hasNext(
> InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> SpecificColumnarIterator.hasNext(Unknown Source)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.processNext(Unknown Source)
> at org.apache.spark.sql.execution.BufferedRowIterator.
> hasNext(BufferedRowIterator.java:43)
> at org.apache.spark.sql.execution.WholeStageCodegenExec$$
> anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(
> BypassMergeSortShuffleWri

How do we process/scale variable size batches in Apache Spark Streaming

2016-08-23 Thread Rachana Srivastava
I am running a spark streaming process where I am getting batch of data after n 
seconds. I am using repartition to scale the application. Since the repartition 
size is fixed we are getting lots of small files when batch size is very small. 
Is there anyway I can change the partitioner logic based on the input batch 
size in order to avoid lots of small files.


Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-23 Thread Michael Allman
I've replied on the issue's page, but in a word, "yes". See 
https://issues.apache.org/jira/browse/SPARK-17204 
.

Michael


> On Aug 23, 2016, at 11:55 AM, Reynold Xin  wrote:
> 
> Does this problem still exist on today's master/branch-2.0? 
> 
> SPARK-16550 was merged. It might be fixed already.
> 
> On Tue, Aug 23, 2016 at 9:37 AM, Michael Allman  > wrote:
> FYI, I posted this to user@ and have followed up with a bug report: 
> https://issues.apache.org/jira/browse/SPARK-17204 
> 
> 
> Michael
> 
>> Begin forwarded message:
>> 
>> From: Michael Allman mailto:mich...@videoamp.com>>
>> Subject: Anyone else having trouble with replicated off heap RDD persistence?
>> Date: August 16, 2016 at 3:45:14 PM PDT
>> To: user mailto:u...@spark.apache.org>>
>> 
>> Hello,
>> 
>> A coworker was having a problem with a big Spark job failing after several 
>> hours when one of the executors would segfault. That problem aside, I 
>> speculated that her job would be more robust against these kinds of executor 
>> crashes if she used replicated RDD storage. She's using off heap storage 
>> (for good reason), so I asked her to try running her job with the following 
>> storage level: `StorageLevel(useDisk = true, useMemory = true, useOffHeap = 
>> true, deserialized = false, replication = 2)`. The job would immediately 
>> fail with a rather suspicious looking exception. For example:
>> 
>> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
>> 9086
>>  at 
>> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>>  at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>>  at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>>  at 
>> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>>  at 
>> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>>  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>>  at 
>> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>>  at 
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>>  Source)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>>  Source)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>>  Source)
>>  at 
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>>  at 
>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>>  at 
>> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>>  at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>>  at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:85)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>  at java.lang.Thread.run(Thread.java:745)
>> 
>> or
>> 
>> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>>  at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>>  at java.util.ArrayList.get(ArrayList.java:429)
>>  at 
>> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>>  at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>>  at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>>  at 
>> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>>  at 
>> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>>  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>>  at 
>> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>>  at 
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>>  at 
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass

is the Lineage of RDD stored as a byte code in memory or a file?

2016-08-23 Thread kant kodali

Hi Guys,
I have this question for a very long time and after diving into the source
code(specifically from the links below) I have a feeling that the lineage of an
RDD (the transformations) are converted into byte code and stored in memory or
disk. or if I were to ask another question on a similar note do we ever store
JVM byte code or python byte code in memory or disk? This make sense to me
because if we were to construct an RDD after a node failure we need to go
through the lineage and execute the respective transformations so storing their
byte codes does make sense however many people seem to disagree with me so it
would be great if someone can clarify.

https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c1639fc3d5b7f796cf/python/pyspark/rdd.py#L1452

https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c1639fc3d5b7f796cf/python/pyspark/rdd.py#L1471

https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c1639fc3d5b7f796cf/python/pyspark/rdd.py#L229
https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py#L241

Re: Spark dev-setup

2016-08-23 Thread Nishadi Kirielle
Hi,
I'm engaged in learning how query execution flow occurs in Spark SQL. In
order to understand the query execution flow, I'm attempting to run an
example in debug mode with intellij IDEA. It would be great if anyone can
help me with debug configurations.

Thanks & Regards
Nishadi

On Tue, Jun 21, 2016 at 4:49 PM, Akhil Das  wrote:

> You can read this documentation to get started with the setup
> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#
> UsefulDeveloperTools-IntelliJ
>
> There was a pyspark setup discussion on SO over here
> http://stackoverflow.com/questions/33478218/write-and-
> run-pyspark-in-intellij-idea
>
> On Mon, Jun 20, 2016 at 7:23 PM, Amit Rana 
> wrote:
>
>> Hi all,
>>
>> I am interested  in figuring out how pyspark works at core/internal
>> level. And  would like to understand the code flow as well.
>> For that I need to run a simple  example  in debug mode so that I can
>> trace the data flow for pyspark.
>> Can anyone please guide me on how do I set up my development environment
>> for the same in intellij IDEA in Windows 7.
>>
>> Thanks,
>> Amit Rana
>>
>
>
>
> --
> Cheers!
>
>