Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
BTW, After I revert SPARK-784, I can see all the jars under lib_managed/jars On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang wrote: > Hi Josh, > > I notice the comments in https://github.com/apache/spark/pull/9575 said > that Datanucleus related jars will still be copied to lib_managed/jars. > But

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
Hi Josh, I notice the comments in https://github.com/apache/spark/pull/9575 said that Datanucleus related jars will still be copied to lib_managed/jars. But I don't see any jars under lib_managed/jars. The weird thing is that I see the jars on another machine, but could not see jars on my laptop e

Re: Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Saisai Shao
Kafka now build-in supports managing metadata itself besides ZK, it is easy to use and change from current ZK implementation. I think here the problem is do we need to manage offset in Spark Streaming level or leave this question to user. If you want to manage offset in user level, letting Spark t

Re: let spark streaming sample come to stop

2015-11-16 Thread Bryan Cutler
Hi Renyi, This is the intended behavior of the streaming HdfsWordCount example. It makes use of a 'textFileStream' which will monitor a hdfs directory for any newly created files and push them into a dstream. It is meant to be run indefinitely, unless interrupted by ctrl-c, for example. -bryan

Mesos cluster dispatcher doesn't respect most args from the submit req

2015-11-16 Thread Jo Voordeckers
Hi all, I'm running the mesos cluster dispatcher, however when I submit jobs with things like jvm args, classpath order and UI port aren't added to the commandline executed by the mesos scheduler. In fact it only cares about the class, jar and num cores/mem. https://github.com/jayv/spark/blob/mes

Re: Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Nick Evans
The only dependancy on Zookeeper I see is here: https://github.com/apache/spark/blob/1c5475f1401d2233f4c61f213d1e2c2ee9673067/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/ReliableKafkaReceiver.scala#L244-L247 If that's the only line that depends on Zookeeper, we could probably tr

Re: slightly more informative error message in MLUtils.loadLibSVMFile

2015-11-16 Thread Jeff Zhang
+1 On Tue, Nov 17, 2015 at 7:43 AM, Joseph Bradley wrote: > That sounds useful; would you mind submitting a JIRA (and a PR if you're > willing)? > Thanks, > Joseph > > On Fri, Oct 23, 2015 at 12:43 PM, Robert Dodier > wrote: > >> Hi, >> >> MLUtils.loadLibSVMFile verifies that indices are 1-base

Re: Spark 1.4.2 release and votes conversation?

2015-11-16 Thread Andrew Lee
I did, and it passes all of our test case, so I'm wondering what did I miss. I know there is the memory leak spill JIRA SPARK-11293, but not sure if that will go in 1.4.2 or 1.4.3, etc. From: Reynold Xin Sent: Friday, November 13, 2015 1:31 PM To: Andrew Lee C

Re: Unchecked contribution (JIRA and PR)

2015-11-16 Thread Joseph Bradley
Hi Sergio, Apart from apologies about limited review bandwidth (from me too!), I wanted to add: It would be interesting to hear what feedback you've gotten from users of your package. Perhaps you could collect feedback by (a) emailing the user list and (b) adding a note in the Spark Packages poin

Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley
One comment about """ 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing the benefits to many tree based methods.

Re: slightly more informative error message in MLUtils.loadLibSVMFile

2015-11-16 Thread Joseph Bradley
That sounds useful; would you mind submitting a JIRA (and a PR if you're willing)? Thanks, Joseph On Fri, Oct 23, 2015 at 12:43 PM, Robert Dodier wrote: > Hi, > > MLUtils.loadLibSVMFile verifies that indices are 1-based and > increasing, and otherwise triggers an error. I'd like to suggest that

Re: Persisting DStreams

2015-11-16 Thread Jean-Baptiste Onofré
Hi Fernando, the "persistence" of a DStream is defined depending of the StorageLevel. Window is not related to persistence: it's the processing of multiple DStream in one, a kind of "gather of DStreams". The transformation is applied on a "slide window". For instance, you define a window of 3

Persisting DStreams

2015-11-16 Thread Fernando O.
Hi all, I was wondering if someone could give me a brief explanation or point me in the right direction in the code for where DStream persistence is donde. I'm looking at DStream.java but all it does is setting the StorageLevel, and neither WindowedDStream or ReducedWindowedDStream seem to chang

Re: Sort Merge Join from the filesystem

2015-11-16 Thread Alex Nastetsky
Done, thanks. On Mon, Nov 9, 2015 at 7:23 PM, Cheng, Hao wrote: > Yes, we definitely need to think how to handle this case, probably even > more common than both sorted/partitioned tables case, can you jump to the > jira and leave comment there? > > > > *From:* Alex Nastetsky [mailto:alex.nastet

Re: Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Cody Koeninger
There are already private methods in the code for interacting with Kafka's offset management api. There's a jira for making those methods public, but TD has been reluctant to merge it https://issues.apache.org/jira/browse/SPARK-10963 I think adding any ZK specific behavior to spark is a bad idea

Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Nick Evans
I really like the Streaming receiverless API for Kafka streaming jobs, but I'm finding the manual offset management adds a fair bit of complexity. I'm sure that others feel the same way, so I'm proposing that we add the ability to have consumer offsets managed via an easy-to-use API. This would be

Re: releasing Spark 1.4.2

2015-11-16 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtLKc2ctNPcq&subj=Re+Spark+1+4+2+release+and+votes+conversation+ > On Nov 15, 2015, at 10:53 PM, Niranda Perera wrote: > > Hi, > > I am wondering when spark 1.4.2 will be released? > > is it in the voting stage at the moment? > > rgds > > --

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
This is the exception I got 15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating default database after error: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
It's about the datanucleus related jars which is needed by spark sql. Without these jars, I could not call data frame related api ( I make HiveContext enabled) On Mon, Nov 16, 2015 at 4:10 PM, Josh Rosen wrote: > As of https://github.com/apache/spark/pull/9575, Spark's build will no > longer p

Re: Support for local disk columnar storage for DataFrames

2015-11-16 Thread Mark Hamstra
FiloDB is also closely reated. https://github.com/tuplejump/FiloDB On Mon, Nov 16, 2015 at 12:24 AM, Nick Pentreath wrote: > Cloudera's Kudu also looks interesting here (getkudu.io) - Hadoop > input/output format support: > https://github.com/cloudera/kudu/blob/master/java/kudu-mapreduce/src/ma

Re: Support for local disk columnar storage for DataFrames

2015-11-16 Thread Nick Pentreath
Cloudera's Kudu also looks interesting here (getkudu.io) - Hadoop input/output format support: https://github.com/cloudera/kudu/blob/master/java/kudu-mapreduce/src/main/java/org/kududb/mapreduce/KuduTableInputFormat.java On Mon, Nov 16, 2015 at 7:52 AM, Reynold Xin wrote: > This (updates) is som

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Josh Rosen
As of https://github.com/apache/spark/pull/9575, Spark's build will no longer place every dependency JAR into lib_managed. Can you say more about how this affected spark-shell for you (maybe share a stacktrace)? On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang wrote: > > Sometimes, the jars under li

Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
Sometimes, the jars under lib_managed is missing. And after I rebuild the spark, the jars under lib_managed is still not downloaded. This would cause the spark-shell fail due to jars missing. Anyone has hit this weird issue ? -- Best Regards Jeff Zhang