Result code of whole stage codegen

2016-08-05 Thread Maciej Bryński
Hi, I have some operation on DataFrame / Dataset. How can I see source code for whole stage codegen ? Is there any API for this ? Or maybe I should configure log4j in specific way ? Regards, -- Maciek Bryński

Re: Result code of whole stage codegen

2016-08-05 Thread Herman van Hövell tot Westerflier
Do you want to see the code that whole stage codegen produces? You can prepend a SQL statement with EXPLAIN CODEGEN ... Or you can add the following code to a DataFrame/Dataset command: import org.apache.spark.sql.execution.debug._ and call the the debugCodegen() command on a Dataframe/Dataset,

Re: Spark SQL and Kryo registration

2016-08-05 Thread Maciej Bryński
Hi Olivier, Did you check performance of Kryo ? I have observations that Kryo is slightly slower than Java Serializer. Regards, Maciek 2016-08-04 17:41 GMT+02:00 Amit Sela : > It should. Codegen uses the SparkConf in SparkEnv when instantiating a new > Serializer. > > On Thu, Aug 4, 2016 at 6:14

Re: Result code of whole stage codegen

2016-08-05 Thread Maciej Bryński
Thank you. That was it. Regards, Maciek 2016-08-05 10:06 GMT+02:00 Herman van Hövell tot Westerflier < hvanhov...@databricks.com>: > Do you want to see the code that whole stage codegen produces? > > You can prepend a SQL statement with EXPLAIN CODEGEN ... > > Or you can add the following code t

Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Nicholas Chammas
Okie doke, I've filed a JIRA for this here: https://issues.apache.org/jira/browse/SPARK-16921 On Fri, Aug 5, 2016 at 2:08 AM Reynold Xin wrote: > Sounds like a great idea! > > On Friday, August 5, 2016, Nicholas Chammas > wrote: > >> Context managers >>

Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread jpivar...@gmail.com
In a few earlier posts [ 1 ] [ 2

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Holden Karau
Spark does not currently support Apache Arrow - probably a good place to chat would be on the Arrow mailing list where they are making progress towards unified JVM & Python/R support which is sort of a precondition of a functioning Arrow interface between Spark and Python. On Fri, Aug 5, 2016 at 1

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Jeremy Smith
If you had a persistent, off-heap buffer of Arrow data on each executor, and you could get an iterator over that buffer from inside of a task, then you could conceivably define an RDD over it by just extending RDD and returning the iterator from the compute method. If you want to make a Dataset or

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Jim Pivarski
I see. I've already started working with Arrow-C++ and talking to members of the Arrow community, so I'll keep doing that. As a follow-up question, is there an approximate timescale for when Spark will support Arrow? I'd just like to know that all the pieces will come together eventually. (In thi

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Holden Karau
I don't think there is an approximate timescale right now and its likely any implementation would depend on a solid Java implementation of Arrow being ready first (or even a guarantee that it necessarily will - although I'm interested in making it happen in some places where it makes sense). On Fr

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Nicholas Chammas
Relevant jira: https://issues.apache.org/jira/browse/SPARK-13534 2016년 8월 5일 (금) 오후 5:22, Holden Karau 님이 작성: > I don't think there is an approximate timescale right now and its likely > any implementation would depend on a solid Java implementation of Arrow > being ready first (or even a guarante

Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Koert Kuipers
The tricky part is that the action needs to be inside the with block, not just the transformation that uses the persisted data. On Aug 5, 2016 1:44 PM, "Nicholas Chammas" wrote: Okie doke, I've filed a JIRA for this here: https://issues.apache. org/jira/browse/SPARK-16921 On Fri, Aug 5, 2016 at

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Jim Pivarski
On Fri, Aug 5, 2016 at 5:14 PM, Nicholas Chammas wrote: > Relevant jira: https://issues.apache.org/jira/browse/SPARK-13534 > Thank you. This ticket describes output from Spark to Arrow for flat (non-nested) tables. Are there no plans to input from Arrow to Spark for general types? Did I misunder

Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Nicholas Chammas
Good point. Do you think it's sufficient to note this somewhere in the documentation (or simply assume that user understanding of transformations vs. actions means they know this), or are there other implications that need to be considered? On Fri, Aug 5, 2016 at 6:50 PM Koert Kuipers wrote: >

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Nicholas Chammas
Don't know much about Spark + Arrow efforts myself; just wanted to share the reference. On Fri, Aug 5, 2016 at 6:53 PM Jim Pivarski wrote: > On Fri, Aug 5, 2016 at 5:14 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Relevant jira: https://issues.apache.org/jira/browse/SPARK-135

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Micah Kornfield
Hi Everyone, I'm an Arrow contributor mostly on the C++ side of things, but I'll try to give a brief update of where I believe the project currently is (the views are my own, but hopefully are fairly accurate :). I think in the long run the diagram mentioned by Jim, is were we would like Arrow to

Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Koert Kuipers
i think it limits the usability of with statement. and it could be somewhat confusing because of this, so i would mention it in docs. i like the idea though. On Fri, Aug 5, 2016 at 7:04 PM, Nicholas Chammas wrote: > Good point. > > Do you think it's sufficient to note this somewhere in the docu

Spark requires sysctl tuning? Servers unresponsive

2016-08-05 Thread Ruslan Dautkhanov
Hello, When I start a spark notebook, it makes some of the servers exhausting some Linux kernel resources, as I can't even ssh to those nodes. And it's not due to servers being hammered. It happens when there are no spark jobs/taks are running. To reproduce this problem, it's enough to just start