SPARK-13843 Next steps

2016-03-22 Thread Kostas Sakellis
Hello all,

I'd like to close out the discussion on SPARK-13843 by getting a poll from
the community on which components we should seriously reconsider re-adding
back to Apache Spark. For reference, here are the modules that were removed
as part of SPARK-13843 and pushed to: https://github.com/spark-packages

   - streaming-flume
   - streaming-akka
   - streaming-mqtt
   - streaming-zeromq
   - streaming-twitter

For us, we'd like to see the streaming-flume added back to Apache Spark.

Thanks,
Kostas


Re: SPARK-13843 Next steps

2016-03-22 Thread Jean-Baptiste Onofré

Thanks for the update Kostas,

for now, kafka stays in Spark and Kinesis will be removed, right ?

Regards
JB

On 03/22/2016 08:27 AM, Kostas Sakellis wrote:

Hello all,

I'd like to close out the discussion on SPARK-13843 by getting a poll
from the community on which components we should seriously reconsider
re-adding back to Apache Spark. For reference, here are the modules that
were removed as part of SPARK-13843 and pushed to:
https://github.com/spark-packages

  * streaming-flume
  * streaming-akka
  * streaming-mqtt
  * streaming-zeromq
  * streaming-twitter

For us, we'd like to see the streaming-flume added back to Apache Spark.

Thanks,
Kostas


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK-13843 Next steps

2016-03-22 Thread Reynold Xin
Kinesis is still in it. I think it's OK to add Flume back.

On Tue, Mar 22, 2016 at 12:29 AM, Jean-Baptiste Onofré 
wrote:

> Thanks for the update Kostas,
>
> for now, kafka stays in Spark and Kinesis will be removed, right ?
>
> Regards
> JB
>
> On 03/22/2016 08:27 AM, Kostas Sakellis wrote:
>
>> Hello all,
>>
>> I'd like to close out the discussion on SPARK-13843 by getting a poll
>> from the community on which components we should seriously reconsider
>> re-adding back to Apache Spark. For reference, here are the modules that
>> were removed as part of SPARK-13843 and pushed to:
>> https://github.com/spark-packages
>>
>>   * streaming-flume
>>   * streaming-akka
>>   * streaming-mqtt
>>   * streaming-zeromq
>>   * streaming-twitter
>>
>> For us, we'd like to see the streaming-flume added back to Apache Spark.
>>
>> Thanks,
>> Kostas
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: SPARK-13843 Next steps

2016-03-22 Thread Jean-Baptiste Onofré

OK, so kafka, kinesis and flume will stay in Spark.

Thanks,
Regards
JB

On 03/22/2016 08:30 AM, Reynold Xin wrote:

Kinesis is still in it. I think it's OK to add Flume back.

On Tue, Mar 22, 2016 at 12:29 AM, Jean-Baptiste Onofré mailto:j...@nanthrax.net>> wrote:

Thanks for the update Kostas,

for now, kafka stays in Spark and Kinesis will be removed, right ?

Regards
JB

On 03/22/2016 08:27 AM, Kostas Sakellis wrote:

Hello all,

I'd like to close out the discussion on SPARK-13843 by getting a
poll
from the community on which components we should seriously
reconsider
re-adding back to Apache Spark. For reference, here are the
modules that
were removed as part of SPARK-13843 and pushed to:
https://github.com/spark-packages

   * streaming-flume
   * streaming-akka
   * streaming-mqtt
   * streaming-zeromq
   * streaming-twitter

For us, we'd like to see the streaming-flume added back to
Apache Spark.

Thanks,
Kostas


--
Jean-Baptiste Onofré
jbono...@apache.org 
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org

For additional commands, e-mail: dev-h...@spark.apache.org





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: java.lang.OutOfMemoryError: Unable to acquire bytes of memory

2016-03-22 Thread james
Hi,
I also found 'Unable to acquire memory' issue using Spark 1.6.1 with Dynamic
allocation on YARN. My case happened with setting
spark.sql.shuffle.partitions larger than 200. From error stack, it has a
diff with issue reported by Nezih and not sure if these has same root cause.

Thanks 
James

16/03/17 16:02:11 INFO spark.MapOutputTrackerMaster: Size of output statuses
for shuffle 0 is 1912805 bytes
16/03/17 16:02:12 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
map output locations for shuffle 1 to hw-node3:55062
16/03/17 16:02:12 INFO spark.MapOutputTrackerMaster: Size of output statuses
for shuffle 0 is 1912805 bytes
16/03/17 16:02:16 INFO scheduler.TaskSetManager: Starting task 280.0 in
stage 153.0 (TID 9390, hw-node5, partition 280,PROCESS_LOCAL, 2432 bytes)
16/03/17 16:02:16 WARN scheduler.TaskSetManager: Lost task 170.0 in stage
153.0 (TID 9280, hw-node5): java.lang.OutOfMemoryError: Unable to acquire
1073741824 bytes of memory, got 1060110796
at
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:91)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:295)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:330)
at
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
at
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-Unable-to-acquire-bytes-of-memory-tp16773p16787.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: java.lang.OutOfMemoryError: Unable to acquire bytes of memory

2016-03-22 Thread Nezih Yigitbasi
Interesting. After experimenting with various parameters increasing
spark.sql.shuffle.partitions and decreasing spark.buffer.pageSize helped my
job go through. BTW I will be happy to help getting this issue fixed.

Nezih

On Tue, Mar 22, 2016 at 1:07 AM james  wrote:

Hi,
> I also found 'Unable to acquire memory' issue using Spark 1.6.1 with
> Dynamic
> allocation on YARN. My case happened with setting
> spark.sql.shuffle.partitions larger than 200. From error stack, it has a
> diff with issue reported by Nezih and not sure if these has same root
> cause.
>
> Thanks
> James
>
> 16/03/17 16:02:11 INFO spark.MapOutputTrackerMaster: Size of output
> statuses
> for shuffle 0 is 1912805 bytes
> 16/03/17 16:02:12 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send
> map output locations for shuffle 1 to hw-node3:55062
> 16/03/17 16:02:12 INFO spark.MapOutputTrackerMaster: Size of output
> statuses
> for shuffle 0 is 1912805 bytes
> 16/03/17 16:02:16 INFO scheduler.TaskSetManager: Starting task 280.0 in
> stage 153.0 (TID 9390, hw-node5, partition 280,PROCESS_LOCAL, 2432 bytes)
> 16/03/17 16:02:16 WARN scheduler.TaskSetManager: Lost task 170.0 in stage
> 153.0 (TID 9280, hw-node5): java.lang.OutOfMemoryError: Unable to acquire
> 1073741824 bytes of memory, got 1060110796
> at
>
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:91)
> at
>
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:295)
> at
>
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:330)
> at
>
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
> at
>
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
> at
> org.apache.spark.sql.execution.Sort$anonfun$1.apply(Sort.scala:90)
> at
> org.apache.spark.sql.execution.Sort$anonfun$1.apply(Sort.scala:64)
> at
>
> org.apache.spark.rdd.RDD$anonfun$mapPartitionsInternal$1$anonfun$apply$21.apply(RDD.scala:728)
> at
>
> org.apache.spark.rdd.RDD$anonfun$mapPartitionsInternal$1$anonfun$apply$21.apply(RDD.scala:728)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
>
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-Unable-to-acquire-bytes-of-memory-tp16773p16787.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
> ​


Re: java.lang.OutOfMemoryError: Unable to acquire bytes of memory

2016-03-22 Thread james
I guess different workload cause diff result ?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-Unable-to-acquire-bytes-of-memory-tp16773p16789.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Can we remove private[spark] from Metrics Source and SInk traits?

2016-03-22 Thread Steve Loughran

On 19 Mar 2016, at 16:16, Pete Robbins 
mailto:robbin...@gmail.com>> wrote:


There are several open Jiras to add new Sinks

OpenTSDB https://issues.apache.org/jira/browse/SPARK-12194
StatsD https://issues.apache.org/jira/browse/SPARK-11574


statsd is nicely easy to test: either listen in on a (localhost, port) or 
simply create a socket and force it into the sink for the test run


Kafka https://issues.apache.org/jira/browse/SPARK-13392

Some have PRs from 2015 so I'm assuming there is not the desire to integrate 
these into core Spark. Opening up the Sink/Source interfaces would at least 
allow these to exist somewhere such as spark-packages without having to pollute 
the o.a.s namespace


On Sat, 19 Mar 2016 at 13:05 Gerard Maas 
mailto:gerard.m...@gmail.com>> wrote:

+1

On Mar 19, 2016 08:33, "Pete Robbins" 
mailto:robbin...@gmail.com>> wrote:
This seems to me to be unnecessarily restrictive. These are very useful 
extension points for adding 3rd party sources and sinks.

I intend to make an Elasticsearch sink available on spark-packages but this 
will require a single class, the sink, to be in the org.apache.spark package 
tree. I could submit the package as a PR to the Spark codebase, and I'd be 
happy to do that but it could be a completely separate add-on.

There are similar issues with writing a 3rd party metrics source which may not 
be of interest to the community at large so would probably not warrant 
inclusion in the Spark codebase.

Any thoughts?



StatefulNetworkWordCount behaviour

2016-03-22 Thread Rishi Mishra
I am trying out StatefulNetworkWordCount from latest Spark master branch.
When I run this example I see a odd behaviour.
If in a batch a key is repeated the output stream prints for each
repetition e.g.  If I key in "ab" five times for input it will show like

(ab,1)
(ab,2)
(ab,3)
(ab,4)
(ab,5)

Is it the intended behaviour to show all the occurrence of the word, or is
it a bug ? If I am a user I would expect only the last entry (ab, 5) . Else
users has to put some logic in application code to get to the latest
value.  I know we can do this by snapshot, but IMO the updated stream
should give us similar functionality.

Is there a reason for not doing this ? i.e. for a given key if multiple
output is generated , only the last one should be returned back.

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra


Re: error occurs to compile spark 1.6.1 using scala 2.11.8

2016-03-22 Thread Ted Yu
>From the error message, it seems some artifacts from Scala 2.10.4 were left
around.

FYI maven 3.3.9 is required for master branch.

On Tue, Mar 22, 2016 at 3:07 AM, Allen  wrote:

> Hi,
>
> I am facing an error when doing compilation from IDEA, please see the
> attached. I fired the build process by clicking "Rebuild Project" in
> "Build" menu in IDEA IDE.
>
> more info here:
> Spark 1.6.1 + scala 2.11.8 + IDEA 15.0.3 + Maven 3.3.3
>
> I can build spark 1.6.1 with scala 2.10.4 successfully in the same way.
>
>
> Help!
> BR,
> Allen Zhang
>
>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>


toPandas very slow

2016-03-22 Thread Josh Levy-Kramer
Hi,

A common pattern in my work is querying large tables in Spark DataFrames
and then needing to do more detailed analysis locally when the data can fit
into memory. However, i've hit a few blockers. In Scala no well developed
DataFrame library exists and in Python the `toPandas` function is very
slow. As Pandas is one of the best DataFrame libraries out there is may be
worth spending some time into making the `toPandas` method more efficient.

Having a quick look at the code it looks like a lot of iteration is
occurring on the Python side. Python is really slow at iterating over large
loop and this should be avoided. If iteration does have to occur its best
done in Cython. Has anyone looked at Cythonising the process? Or even
better serialising directly to Numpy arrays instead of the intermediate
lists of Rows.

Here are some links to the current code:

topandas:
https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L1342

collect:
https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L233

_load_from_socket:
https://github.com/apache/spark/blob/a60f91284ceee64de13f04559ec19c13a820a133/python/pyspark/rdd.py#L123

Josh Levy-Kramer
Data Scientist @ Starcount


Job description only visible after job finish

2016-03-22 Thread hansbogert
Hi, 

I’m trying to do some dynamic scheduling by an external application by
looking at the jobs in a Spark framework.

I need the job description to know which kind of query I’m dealing with. The
problem is that the job description (set with: sparkCtx.setJobDescription)
but in case of a job with multiple stages, the job description is not seen
in the UI, and more importantly in my case, is not retrievable in the API
endpoint of `host:api/v1/application/x/jobs`

Is there a reason why the job description is not shown for non-finished,
multi-stage, jobs?

Further info: Spark 1.5.1, the jobs for now are sent ad-hoc through the
spark-shell.

Thanks, 

Hans



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Job-description-only-visible-after-job-finish-tp16795.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



new object store driver for Spark

2016-03-22 Thread Gil Vernik
We recently released an object store connector for Spark. 
https://github.com/SparkTC/stocator
Currently this connector contains driver for the Swift based object store 
( like SoftLayer or any other Swift cluster ), but it can easily support 
additional object stores.
There is a pending patch to support Amazon S3 object store. 

The major highlights that this connector doesn't create any temporary 
files  and so it achieves very fast response times when Spark persist data 
in the object store.
The new connector supports speculate mode and covers various failure 
scenarios ( like two Spark tasks writing into same object, partial 
corrupted data due to run time exceptions in Spark master, etc ).  It also 
covers https://issues.apache.org/jira/browse/SPARK-10063 and other known 
issues.

The detail algorithm for fault tolerance will be released very soon. For 
now, those who interested, can view the implementation in the code itself.

 https://github.com/SparkTC/stocator contains all the details how to setup 
and use with Spark.

A series of tests showed that the new connector obtains 70% improvements 
for write operations from Spark to Swift and about 30% improvements for 
read operations from Swift into Spark ( comparing to the existing driver 
that Spark uses to integrate with objects stored in Swift). 

There is an ongoing work to add more coverage and fix some known bugs / 
limitations.

All the best
Gil





Re: toPandas very slow

2016-03-22 Thread Mark Vervuurt
Hi Josh,

The work around we figured out to solve network latency and out of memory 
problems with the toPandas method was to create Pandas DataFrames or Numpy 
Arrays using MapPartitions for each partition. Maybe a standard solution around 
this line of thought could be built. The integration is quite tedious ;)

I hope this helps.

Regards,
Mark

> On 22 Mar 2016, at 13:40, Josh Levy-Kramer  wrote:
> 
> Hi,
> 
> A common pattern in my work is querying large tables in Spark DataFrames and 
> then needing to do more detailed analysis locally when the data can fit into 
> memory. However, i've hit a few blockers. In Scala no well developed 
> DataFrame library exists and in Python the `toPandas` function is very slow. 
> As Pandas is one of the best DataFrame libraries out there is may be worth 
> spending some time into making the `toPandas` method more efficient.
> 
> Having a quick look at the code it looks like a lot of iteration is occurring 
> on the Python side. Python is really slow at iterating over large loop and 
> this should be avoided. If iteration does have to occur its best done in 
> Cython. Has anyone looked at Cythonising the process? Or even better 
> serialising directly to Numpy arrays instead of the intermediate lists of 
> Rows.
> 
> Here are some links to the current code:
> 
> topandas: 
> https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L1342
>  
> 
> 
> collect: 
> https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L233
>  
> 
> 
> _load_from_socket: 
> https://github.com/apache/spark/blob/a60f91284ceee64de13f04559ec19c13a820a133/python/pyspark/rdd.py#L123
>  
> 
> 
> Josh Levy-Kramer
> Data Scientist @ Starcount



Re: toPandas very slow

2016-03-22 Thread Wes McKinney
hi all,

I recently did an analysis of the performance of toPandas

summary: http://wesmckinney.com/blog/pandas-and-apache-arrow/
ipython notebook: https://gist.github.com/wesm/0cb5531b1c2e346a0007

One solution I'm planning for this is an alternate serializer for
Spark DataFrames, with an optimized (C++ / Cython) conversion to a
pandas.DataFrame on the Python side:

https://issues.apache.org/jira/browse/SPARK-13534

I'm happy to discuss in more detail with those interested. The basic
idea is that deserializing binary data directly into NumPy arrays is
what you want, but you need some array-oriented / columnar memory
representation to push over the wire. Apache Arrow is designed
specifically for this use case.

best,
Wes

On Tue, Mar 22, 2016 at 7:11 AM, Mark Vervuurt  wrote:
> Hi Josh,
>
> The work around we figured out to solve network latency and out of memory
> problems with the toPandas method was to create Pandas DataFrames or Numpy
> Arrays using MapPartitions for each partition. Maybe a standard solution
> around this line of thought could be built. The integration is quite tedious
> ;)
>
> I hope this helps.
>
> Regards,
> Mark
>
> On 22 Mar 2016, at 13:40, Josh Levy-Kramer  wrote:
>
> Hi,
>
> A common pattern in my work is querying large tables in Spark DataFrames and
> then needing to do more detailed analysis locally when the data can fit into
> memory. However, i've hit a few blockers. In Scala no well developed
> DataFrame library exists and in Python the `toPandas` function is very slow.
> As Pandas is one of the best DataFrame libraries out there is may be worth
> spending some time into making the `toPandas` method more efficient.
>
> Having a quick look at the code it looks like a lot of iteration is
> occurring on the Python side. Python is really slow at iterating over large
> loop and this should be avoided. If iteration does have to occur its best
> done in Cython. Has anyone looked at Cythonising the process? Or even better
> serialising directly to Numpy arrays instead of the intermediate lists of
> Rows.
>
> Here are some links to the current code:
>
> topandas:
> https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L1342
>
> collect:
> https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L233
>
> _load_from_socket:
> https://github.com/apache/spark/blob/a60f91284ceee64de13f04559ec19c13a820a133/python/pyspark/rdd.py#L123
>
> Josh Levy-Kramer
> Data Scientist @ Starcount
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: toPandas very slow

2016-03-22 Thread Josh Levy-Kramer
Hi all,

Wez, I read your thread earlier today after I sent this message and its
exciting someone of your caliber working on the issue :)

For a short term solution i've created a Gist which performs the toPandas
operation using the mapPartitions method suggested by Mark:
https://gist.github.com/joshlk/871d58e01417478176e7

Regards,
Josh

On 22 March 2016 at 14:55, Wes McKinney  wrote:

> hi all,
>
> I recently did an analysis of the performance of toPandas
>
> summary: http://wesmckinney.com/blog/pandas-and-apache-arrow/
> ipython notebook: https://gist.github.com/wesm/0cb5531b1c2e346a0007
>
> One solution I'm planning for this is an alternate serializer for
> Spark DataFrames, with an optimized (C++ / Cython) conversion to a
> pandas.DataFrame on the Python side:
>
> https://issues.apache.org/jira/browse/SPARK-13534
>
> I'm happy to discuss in more detail with those interested. The basic
> idea is that deserializing binary data directly into NumPy arrays is
> what you want, but you need some array-oriented / columnar memory
> representation to push over the wire. Apache Arrow is designed
> specifically for this use case.
>
> best,
> Wes
>
> On Tue, Mar 22, 2016 at 7:11 AM, Mark Vervuurt 
> wrote:
> > Hi Josh,
> >
> > The work around we figured out to solve network latency and out of memory
> > problems with the toPandas method was to create Pandas DataFrames or
> Numpy
> > Arrays using MapPartitions for each partition. Maybe a standard solution
> > around this line of thought could be built. The integration is quite
> tedious
> > ;)
> >
> > I hope this helps.
> >
> > Regards,
> > Mark
> >
> > On 22 Mar 2016, at 13:40, Josh Levy-Kramer  wrote:
> >
> > Hi,
> >
> > A common pattern in my work is querying large tables in Spark DataFrames
> and
> > then needing to do more detailed analysis locally when the data can fit
> into
> > memory. However, i've hit a few blockers. In Scala no well developed
> > DataFrame library exists and in Python the `toPandas` function is very
> slow.
> > As Pandas is one of the best DataFrame libraries out there is may be
> worth
> > spending some time into making the `toPandas` method more efficient.
> >
> > Having a quick look at the code it looks like a lot of iteration is
> > occurring on the Python side. Python is really slow at iterating over
> large
> > loop and this should be avoided. If iteration does have to occur its best
> > done in Cython. Has anyone looked at Cythonising the process? Or even
> better
> > serialising directly to Numpy arrays instead of the intermediate lists of
> > Rows.
> >
> > Here are some links to the current code:
> >
> > topandas:
> >
> https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L1342
> >
> > collect:
> >
> https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L233
> >
> > _load_from_socket:
> >
> https://github.com/apache/spark/blob/a60f91284ceee64de13f04559ec19c13a820a133/python/pyspark/rdd.py#L123
> >
> > Josh Levy-Kramer
> > Data Scientist @ Starcount
> >
> >
>



-- 


*Josh Levy-Kramer*Data Scientist


+44 (0) 203 770 7554
+44 (0) 781 797 0736
Henry Wood House, 2 Riding House Street, W1W 7FA London

www.starcount.com

*Confidentiality*

The information contained in this e-mail is confidential, may be privileged
and is intended solely for the use of the named addressee. Access to this
e-mail by any other person is not authorised. If you are not the intended
recipient, you should not disclose, copy, distribute, take any action or
rely on it and you should please notify the sender by reply. Any opinions
expressed are not necessarily those of the company.
We may monitor all incoming and outgoing emails in line with current
legislation. We have taken steps to ensure that this email and attachments
are free from any virus, but it remains your responsibility to ensure that
viruses do not adversely affect you.


EclairJS for "Powered by Spark" Wiki page

2016-03-22 Thread DavidFallside
Can someone please post the following information on the "Powered by Spark"
wiki pages, thank you.

Organization: IBM www.ibm.com/spark
 
Project URL: https://github.com/EclairJS/eclairjs-node
 
Brief project description: EclairJS enables Node.js developers to code
against Spark, and data scientists to use Javascript in Jupyter notebooks.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/EclairJS-for-Powered-by-Spark-Wiki-page-tp16800.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK-13843 Next steps

2016-03-22 Thread Marcelo Vanzin
+1 for getting flume back.

On Tue, Mar 22, 2016 at 12:27 AM, Kostas Sakellis  wrote:
> Hello all,
>
> I'd like to close out the discussion on SPARK-13843 by getting a poll from
> the community on which components we should seriously reconsider re-adding
> back to Apache Spark. For reference, here are the modules that were removed
> as part of SPARK-13843 and pushed to: https://github.com/spark-packages
>
> streaming-flume
> streaming-akka
> streaming-mqtt
> streaming-zeromq
> streaming-twitter
>
> For us, we'd like to see the streaming-flume added back to Apache Spark.
>
> Thanks,
> Kostas



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK-13843 Next steps

2016-03-22 Thread Cody Koeninger
I'm in favor of everything in /extras and /external being removed, but
I'm more in favor of making a decision and moving on.

On Tue, Mar 22, 2016 at 12:20 PM, Marcelo Vanzin  wrote:
> +1 for getting flume back.
>
> On Tue, Mar 22, 2016 at 12:27 AM, Kostas Sakellis  wrote:
>> Hello all,
>>
>> I'd like to close out the discussion on SPARK-13843 by getting a poll from
>> the community on which components we should seriously reconsider re-adding
>> back to Apache Spark. For reference, here are the modules that were removed
>> as part of SPARK-13843 and pushed to: https://github.com/spark-packages
>>
>> streaming-flume
>> streaming-akka
>> streaming-mqtt
>> streaming-zeromq
>> streaming-twitter
>>
>> For us, we'd like to see the streaming-flume added back to Apache Spark.
>>
>> Thanks,
>> Kostas
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org