Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Reynold Xin
You can look into its source code: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana wrote: > Hi all, > > Did anyone get a chance to look into it?? > Any sort of guidance will be much appreciate

Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Amit Rana
Hi all, Did anyone get a chance to look into it?? Any sort of guidance will be much appreciated. Thanks, Amit Rana On 7 Jul 2016 14:28, "Amit Rana" wrote: > As mentioned in the documentation: > PythonRDD objects launch Python subprocesses and communicate with them > using pipes, sending the use

Re: spark1.6.2 ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread Sun Rui
maybe related to "parquet-provided”? remove "parquet-provided” profile when making the distribution or adding the parquet jar into class path when running Spark > On Jul 8, 2016, at 09:25, kevin wrote: > > parquet-provided

spark1.6.2 ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread kevin
hi,all: I build spark 1.6.2 frpm source with : ./make-distribution.sh --name "hadoop2.7.1" --tgz "-Pyarn,hadoop-2.6,parquet-provided,hive,hive-thriftserver" -DskipTests -Dhadoop.version=2.7.1 when I try to run : ./bin/run-example sql.RDDRelation or ./spark-shell I met the error with :(but I can r

Re: Stopping Spark executors

2016-07-07 Thread Mr rty ff
HiI am sorry but its still not clearDo you mean ./bin/spark-shell --master localAnd what I do after that killing the org.apache.spark.deploy.SparkSubmit --master local --class org.apache.spark.repl.Main --name Spark shell spark-shell will kill the shell so I couldn't send the commands .Thanks

Re: Bad JIRA components

2016-07-07 Thread Nicholas Chammas
Thanks Reynold. On Thu, Jul 7, 2016 at 5:03 PM Reynold Xin wrote: > I deleted those. > > > On Thu, Jul 7, 2016 at 1:27 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> >> https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:components-p

Re: Stopping Spark executors

2016-07-07 Thread Jacek Laskowski
Hi, Then use --master with spark standalone, yarn, or mesos. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Thu, Jul 7, 2016 at 10:35 PM, Mr rty ff wrote: > I do

Re: Bad JIRA components

2016-07-07 Thread Reynold Xin
I deleted those. On Thu, Jul 7, 2016 at 1:27 PM, Nicholas Chammas wrote: > > https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:components-panel > > There are several bad components in there, like docs, MLilb, and sq;. > I’ve updated the issues that

Re: [DISCUSS] Minimize use of MINOR, BUILD, and HOTFIX w/ no JIRA

2016-07-07 Thread Tom Graves
I think the problems comes in with your definition as well as peoples interpretation of that.  I don't agree with your statement of "where the "how" is different from the "what"".   This could apply to a lot of things.  I could easily file a jira that says remove synchronization on routine x, th

Re: Stopping Spark executors

2016-07-07 Thread Mr rty ff
I don't think Its the proper way to recreate the bug becouse I should continue to send commands to the shellThey talking about killing the  CoarseGrainedExecutorBackend On Thursday, July 7, 2016 11:32 PM, Jacek Laskowski wrote: Hi, It appears you're running local mode (local[*] assumed)

Re: Stopping Spark executors

2016-07-07 Thread Jacek Laskowski
Hi, It appears you're running local mode (local[*] assumed) so killing spark-shell *will* kill the one and only executor -- the driver :) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.c

Re: Stopping Spark executors

2016-07-07 Thread Mr rty ff
This what I get when I run the command946 sun.tools.jps.Jps -lm7443 org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name Spark shell spark-shellI don't think that shululd kill SparkSubmit  process On Thursday, July 7, 2016 9:58 PM, Jacek Laskowski wrote: Hi,

Bad JIRA components

2016-07-07 Thread Nicholas Chammas
https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:components-panel There are several bad components in there, like docs, MLilb, and sq;. I’ve updated the issues that were assigned to them, but I don’t know if there is a way to delete these components

Re: Expanded docs for the various storage levels

2016-07-07 Thread Nicholas Chammas
JIRA is here: https://issues.apache.org/jira/browse/SPARK-16427 On Thu, Jul 7, 2016 at 3:18 PM Reynold Xin wrote: > Please create a patch. Thanks! > > > On Thu, Jul 7, 2016 at 12:07 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I’m looking at the docs here: >> >> >> http://spa

Re: [DISCUSS] Minimize use of MINOR, BUILD, and HOTFIX w/ no JIRA

2016-07-07 Thread Sean Owen
I don't agree that every change needs a JIRA, myself. Really, we didn't choose to have this system split across JIRA and Github PRs. It's necessitated by how the ASF works (and with some good reasons). But while we have this dual system, I figure, let's try to make some sense of it. I think it mak

Re: Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Michael Allman
FYI if you just want to look at the source code, there are source jars for those binary versions in maven central. I was just looking at the metastore source code last night. Michael > On Jul 7, 2016, at 12:13 PM, Jonathan Kelly wrote: > > I'm not sure, but I think it's > https://github.com/

Re: Expanded docs for the various storage levels

2016-07-07 Thread Reynold Xin
Please create a patch. Thanks! On Thu, Jul 7, 2016 at 12:07 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > I’m looking at the docs here: > > > http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel >

Re: Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Jonathan Kelly
I'm not sure, but I think it's https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2. It would be really nice though to have this whole process better documented and more "official" than just building from somebody's personal fork of Hive. Or is there some way that the Spark community could

Expanded docs for the various storage levels

2016-07-07 Thread Nicholas Chammas
I’m looking at the docs here: http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel A newcomer to Spark won’t understand the meaning of _2, or the meaning of _SER (or its value), and won’t

Re: Stopping Spark executors

2016-07-07 Thread Jacek Laskowski
Hi, Use jps -lm and see the processes on the machine(s) to kill. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Jul 6, 2016 at 9:49 PM, Mr rty ff wrote: > H

Re: SPARK-8813 - combining small files in spark sql

2016-07-07 Thread Reynold Xin
When using native data sources (e.g. Parquet, ORC, JSON, ...), partitions are automatically merged so they would add up to a specific size, configurable by spark.sql.files.maxPartitionBytes. spark.sql.files.openCostInBytes is used to specify the cost of each "file". That is, an empty file will be

Re: [DISCUSS] Minimize use of MINOR, BUILD, and HOTFIX w/ no JIRA

2016-07-07 Thread Tom Graves
Popping this back up to the dev list again.  I see a bunch of checkins with minor or hotfix.   It seems to me we shouldn't be doing this, but I would like to hear thoughts from others.  I see no reason we can't have a jira for each of those issues, it only takes a few seconds to file one and it

Re: Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Marcelo Vanzin
(Actually that's "spark" and not "spark2", so yeah, that doesn't really answer the question.) On Thu, Jul 7, 2016 at 11:38 AM, Marcelo Vanzin wrote: > My guess would be https://github.com/pwendell/hive/tree/release-1.2.1-spark > > On Thu, Jul 7, 2016 at 11:37 AM, Zhan Zhang wrote: >> I saw the p

Re: Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Marcelo Vanzin
My guess would be https://github.com/pwendell/hive/tree/release-1.2.1-spark On Thu, Jul 7, 2016 at 11:37 AM, Zhan Zhang wrote: > I saw the pom file having hive version as > 1.2.1.spark2. But I cannot find the branch in > https://github.com/pwendell/ > > Does anyone know where the repo is? > > Tha

Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Zhan Zhang
I saw the pom file having hive version as 1.2.1.spark2. But I cannot find the branch in https://github.com/pwendell/ Does anyone know where the repo is? Thanks. Zhan Zhang -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Anyone-knows-the-hive-r

Why the org.apache.spark.sql.catalyst.expressions.SortArray is with CodegenFallback?

2016-07-07 Thread 楊閔富
I found CollapseCodengenStages.supportCodegen(e: Expression) will determine SortArray expression not CodegenSupported since SortArray is with CodegenFallback. Can I ask why the SortArray is not CodeGenSupoort??

Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Amit Rana
As mentioned in the documentation: PythonRDD objects launch Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed. I am trying to understand the implementation of how this data transfer is happening using pipes. Can anyone please guide me

Re: Latest spark release in the 1.4 branch

2016-07-07 Thread Niranda Perera
Hi Mark, I agree. :-) We already have a product released with Spark 1.4.1 with some custom extensions and now we are doing a patch release. We will update Spark to the latest 2.x version in the next release. Best On Thu, Jul 7, 2016 at 1:12 PM, Mark Hamstra wrote: > You've got to satisfy my cu

Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Sun Rui
You can read https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals For pySpark data flow on worker nodes, you can read the source code of PythonRDD.scala. Python worker processes communicate with Spark executors

Understanding pyspark data flow on worker nodes

2016-07-07 Thread Amit Rana
Hi all, I am trying to trace the data flow in pyspark. I am using intellij IDEA in windows 7. I had submitted a python job as follows: --master local[4] I have made the following insights after running the above command in debug mode: ->Locally when a pyspark's interpreter starts, it also s

Re: SPARK-8813 - combining small files in spark sql

2016-07-07 Thread Sean Owen
-user Reynold made the comment that he thinks this was resolved by another change; maybe he can comment. On Thu, Jul 7, 2016 at 7:53 AM, Ajay Srivastava wrote: > Hi, > > This jira https://issues.apache.org/jira/browse/SPARK-8813 is fixed in spark > 2.0. > But resolution is not mentioned there. >

Re: Latest spark release in the 1.4 branch

2016-07-07 Thread Mark Hamstra
You've got to satisfy my curiosity, though. Why would you want to run such a badly out-of-date version in production? I mean, 2.0.0 is just about ready for release, and lagging three full releases behind, with one of them being a major version release, is a long way from where Spark is now. On W