executor lost when running sparksql

2016-01-05 Thread qinggangwa...@gmail.com
Hi all, I am running sparksql in hiveql dialect, the sql is like "select * from (select * from t1 order by t1.id desc) as ff". The sql succeed when it runs only once, but it failed when I run the sql five times at the same time. It seemed that the thread is dumped and executors are lost. T

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Hamel Kothari
The "Too Many Files" part of the exception is just indicative of the fact that when that call was made, too many files were already open. It doesn't necessarily mean that that line is the source of all of the open files, that's just the point at which it hit its limit. What I would recommend is to

Re: UpdateStateByKey : Partitioning and Shuffle

2016-01-05 Thread Tathagata Das
Both mapWithState and updateStateByKey by default uses the HashPartitioner, and hashes the key in the key-value DStream on which the state operation is applied. The new data and state is partition in the exact same partitioner, so that same keys from the new data (from the input DStream) get shuffl

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Priya Ch
Yes, the fileinputstream is closed. May be i didn't show in the screen shot . As spark implements, sort-based shuffle, there is a parameter called maximum merge factor which decides the number of files that can be merged at once and this avoids too many open files. I am suspecting that it is somet

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Jeff Zhang
+1 On Wed, Jan 6, 2016 at 9:18 AM, Juliet Hougland wrote: > Most admins I talk to about python and spark are already actively (or on > their way to) managing their cluster python installations. Even if people > begin using the system python with pyspark, there is eventually a user who > needs a

UpdateStateByKey : Partitioning and Shuffle

2016-01-05 Thread Soumitra Johri
Hi, I am relatively new to Spark and am using updateStateByKey() operation to maintain state in my Spark Streaming application. The input data is coming through a Kafka topic. 1. I want to understand how are DStreams partitioned? 2. How does the partitioning work with mapWithState() or u

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Juliet Hougland
Most admins I talk to about python and spark are already actively (or on their way to) managing their cluster python installations. Even if people begin using the system python with pyspark, there is eventually a user who needs a complex dependency (like pandas or sklearn) on the cluster. No admin

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
I don't think that we're planning to drop Java 7 support for Spark 2.0. Personally, I would recommend using Java 8 if you're running Spark 1.5.0+ and are using SQL/DataFrames so that you can benefit from improvements to code cache flushing in the Java 8 JVMs. Spark SQL's generated classes can fill

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
hey evil admin:) i think the bit about java was from me? if so, i meant to indicate that the reality for us is java is 1.7 on most (all?) clusters. i do not believe spark prefers java 1.8. my point was that even although java 1.7 is getting old as well it would be a major issue for me if spark drop

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
> > Note that you _can_ use a Python 2.7 `ipython` executable on the driver > while continuing to use a vanilla `python` executable on the executors Whoops, just to be clear, this should actually read "while continuing to use a vanilla `python` 2.7 executable". On Tue, Jan 5, 2016 at 3:07 PM, Jo

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
Yep, the driver and executors need to have compatible Python versions. I think that there are some bytecode-level incompatibilities between 2.6 and 2.7 which would impact the deserialization of Python closures, so I think you need to be running the same 2.x version for all communicating Spark proce

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
I think all the slaves need the same (or a compatible) version of Python installed since they run Python code in PySpark jobs natively. On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers wrote: > interesting i didnt know that! > > On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas < > nicholas.cham...@g

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
if python 2.7 only has to be present on the node that launches the app (does it?) than that could be important indeed. On Tue, Jan 5, 2016 at 6:02 PM, Koert Kuipers wrote: > interesting i didnt know that! > > On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote:

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
interesting i didnt know that! On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas wrote: > even if python 2.7 was needed only on this one machine that launches the > app we can not ship it with our software because its gpl licensed > > Not to nitpick, but maybe this is important. The Python licens

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
even if python 2.7 was needed only on this one machine that launches the app we can not ship it with our software because its gpl licensed Not to nitpick, but maybe this is important. The Python license is GPL-compatible but not GPL : Note GPL-compatible do

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
Created JIRA: https://issues.apache.org/jira/browse/SPARK-12661 On Tue, Jan 5, 2016 at 2:49 PM, Koert Kuipers wrote: > i do not think so. > > does the python 2.7 need to be installed on all slaves? if so, we do not > have direct access to those. > > also, spark is easy for us to ship with our sof

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
i do not think so. does the python 2.7 need to be installed on all slaves? if so, we do not have direct access to those. also, spark is easy for us to ship with our software since its apache 2 licensed, and it only needs to be present on the machine that launches the app (thanks to yarn). even if

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
If users are able to install Spark 2.0 on their RHEL clusters, then I imagine that they're also capable of installing a standalone Python alongside that Spark version (without changing Python systemwide). For instance, Anaconda/Miniconda make it really easy to install Python 2.7.x/3.x without impac

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
yeah, the practical concern is that we have no control over java or python version on large company clusters. our current reality for the vast majority of them is java 7 and python 2.6, no matter how outdated that is. i dont like it either, but i cannot change it. we currently don't use pyspark s

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
As I pointed out in my earlier email, RHEL will support Python 2.6 until 2020. So I'm assuming these large companies will have the option of riding out Python 2.6 until then. Are we seriously saying that Spark should likewise support Python 2.6 for the next several years? Even though the core Pyth

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Julio Antonio Soto de Vicente
Unfortunately, Koert is right. I've been in a couple of projects using Spark (banking industry) where CentOS + Python 2.6 is the toolbox available. That said, I believe it should not be a concern for Spark. Python 2.6 is old and busted, which is totally opposite to the Spark philosophy IMO.

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Ted Yu
+1 > On Jan 5, 2016, at 10:49 AM, Davies Liu wrote: > > +1 > > On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas > wrote: >> +1 >> >> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python >> 2.6 is ancient history and the core Python developers stopped supporting it >> in

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
rhel/centos 6 ships with python 2.6, doesnt it? if so, i still know plenty of large companies where python 2.6 is the only option. asking them for python 2.7 is not going to work so i think its a bad idea On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland wrote: > I don't see a reason Spark 2.0 w

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Juliet Hougland
I don't see a reason Spark 2.0 would need to support Python 2.6. At this point, Python 3 should be the default that is encouraged. Most organizations acknowledge the 2.7 is common, but lagging behind the version they should theoretically use. Dropping python 2.6 support sounds very reasonable to me

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
+1 On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas wrote: > +1 > > Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python > 2.6 is ancient history and the core Python developers stopped supporting it > in 2013. REHL 5 is not a good enough reason to continue support for Pytho

Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Rachana Srivastava
I have a very simple two lines program. I am getting input from Kafka and save the input in a file and counting the input received. My code looks like this, when I run this code I am getting two accumulator count for each input. HashMap kafkaParams = new HashMap(); kafkaParams.put("metadata.

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
+1 Red Hat supports Python 2.6 on REHL 5 until 2020 , but otherwise yes, Python 2.6 is ancient history and the core Python developers stopped supporting it in 2013. REHL 5 is not a good enough reason to continue support for Python

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Allen Zhang
plus 1, we are currently using python 2.7.2 in production environment. 在 2016-01-05 18:11:45,"Meethu Mathew" 写道: +1 We use Python 2.7 Regards, Meethu Mathew On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin wrote: Does anybody here care about us dropping support for Python 2.6 in Spark

RE: Support off-loading computations to a GPU

2016-01-05 Thread Kazuaki Ishizaki
Hi Alexander, Thank you for having an interest. We used a LR derived from a Spark sample program https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkLR.scala (not from mllib or ml). Here are scala source files for GPU and non-GPU versions. GPU: h

Re:Support off-loading computations to a GPU

2016-01-05 Thread Kazuaki Ishizaki
Hi Allen, Thank you for having an interest. For quick start, I prepared a new page "Quick Start" at https://github.com/kiszk/spark-gpu/wiki/Quick-Start. You can install the package with two lines and run a sample program with one line. We mean that "off-loading" is to exploit GPU for a task exe

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Meethu Mathew
+1 We use Python 2.7 Regards, Meethu Mathew On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin wrote: > Does anybody here care about us dropping support for Python 2.6 in Spark > 2.0? > > Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json > parsing) when compared with Python 2.7. S

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Sean Owen
+juliet for an additional opinion, but FWIW I think it's safe to say that future CDH will have a more consistent Python story and that story will support 2.7 rather than 2.6. On Tue, Jan 5, 2016 at 7:17 AM, Reynold Xin wrote: > Does anybody here care about us dropping support for Python 2.6 in Sp

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread yash datta
+1 On Tue, Jan 5, 2016 at 1:57 PM, Jian Feng Zhang wrote: > +1 > > We use Python 2.7+ and 3.4+ to call PySpark. > > 2016-01-05 15:58 GMT+08:00 Kushal Datta : > >> +1 >> >> >> Dr. Kushal Datta >> Senior Research Scientist >> Big Data Research & Pathfinding >> Intel Corporation, USA. >> >> On

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Jian Feng Zhang
+1 We use Python 2.7+ and 3.4+ to call PySpark. 2016-01-05 15:58 GMT+08:00 Kushal Datta : > +1 > > > Dr. Kushal Datta > Senior Research Scientist > Big Data Research & Pathfinding > Intel Corporation, USA. > > On Mon, Jan 4, 2016 at 11:52 PM, Jean-Baptiste Onofré > wrote: > >> +1 >> >> no