Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread Yanbo Liang
Let's move the discussion to JIRA. Thanks! On Fri, Oct 7, 2016 at 8:43 PM, 王磊(安全部) wrote: > https://issues.apache.org/jira/browse/SPARK-17825 > > Actually I had created a JIRA. Could you let me your progress to avoid > duplicated work. > > Thanks! > > 发件人: didi > 日期: 2016年10月8日 星期六 上午12:21 > 至:

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread 安全部
https://issues.apache.org/jira/browse/SPARK-17825 Actually I had created a JIRA. Could you let me your progress to avoid duplicated work. Thanks! 发件人: didi mailto:wangleikidd...@didichuxing.com>> 日期: 2016年10月8日 星期六 上午12:21 至: Yanbo Liang mailto:yblia...@gmail.com>> 抄送: "d...@spark.apache.org

Fw: Issue with Spark Streaming with checkpointing in Spark 2.0

2016-10-07 Thread Arijit
Resending, not sure if had sent to user@spark.apache.org earlier. Thanks, Arijit From: Arijit Sent: Friday, October 7, 2016 6:06 PM To: user@spark.apache.org Subject: Issue with Spark Streaming with checkpointing in Spark 2.0 In a Spark Streaming sample code

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Cody Koeninger
Without a hell of a lot more work, Assign would be the only strategy usable. On Fri, Oct 7, 2016 at 3:25 PM, Michael Armbrust wrote: >> The implementation is totally and completely different however, in ways >> that leak to the end user. > > > Can you elaborate? Especially in the context of the i

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Scott Reynolds
It is always the case that 0.8 and 0.9 will work with a 0.10 broker. On Fri, Oct 7, 2016 at 1:28 PM Michael Armbrust wrote: > > 0.10 consumers won't work on an earlier broker. > Earlier consumers will (should?) work on a 0.10 broker. > > > This lines up with my testing. Is there a page I'm mis

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
> 0.10 consumers won't work on an earlier broker. > Earlier consumers will (should?) work on a 0.10 broker. > This lines up with my testing. Is there a page I'm missing that describes this? Like does a 0.9 client work with 0.8 broker? Is it always old clients can talk to new brokers but not vi

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
> > The implementation is totally and completely different however, in ways > that leak to the end user. Can you elaborate? Especially in the context of the interface provided by structured streaming.

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Cody Koeninger
0.10 consumers won't work on an earlier broker. Earlier consumers will (should?) work on a 0.10 broker. The main things earlier consumers lack from a user perspective is support for SSL, and pre-fetching messages. The implementation is totally and completely different however, in ways that leak

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Reynold Xin
Does Kafka 0.10 work on a Kafka 0.8/0.9 cluster? On Fri, Oct 7, 2016 at 1:14 PM, Jeremy Smith wrote: > +1 > > We're on CDH, and it will probably be a while before they support Kafka > 0.10. At the same time, we don't use their Spark and we're looking forward > to upgrading to 2.0.x and using st

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Jeremy Smith
+1 We're on CDH, and it will probably be a while before they support Kafka 0.10. At the same time, we don't use their Spark and we're looking forward to upgrading to 2.0.x and using structured streaming. I was just going to write our own Kafka Source implementation which uses the existing KafkaRD

Re: How to Disable or do minimal Logging for apache spark client Driver program?

2016-10-07 Thread kant kodali
got it! Thanks! On Fri, Oct 7, 2016 12:41 PM, Jakob Odersky ja...@odersky.com wrote: Hi Kant, job submission through the command line is not strictly required, although it is the most common way (it's flexible and easy to use) in which applications that depend on spark are run. The shell

Re: How to Disable or do minimal Logging for apache spark client Driver program?

2016-10-07 Thread Jakob Odersky
Hi Kant, job submission through the command line is not strictly required, although it is the most common way (it's flexible and easy to use) in which applications that depend on spark are run. The shell script "spark-submit" ends up doing similar things to what your code snippet shows. I asked if

Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
We recently merged support for Kafak 0.10.0 in Structured Streaming, but I've been hearing a few people tell me that they are stuck on an older version of Kafka and cannot upgrade. I'm considering revisiting SPARK-17344 , but it would be good to h

Map with state keys serialization

2016-10-07 Thread Joey Echeverria
Looking at the source code for StateMap[1], which is used by JavaPairDStream#mapWithState(), it looks like state keys are serialized using an ObjectOutputStream. I couldn't find a reference to this restriction in the documentation. Did I miss that? Unless I'm mistaken, I'm guessing there isn't a w

Re: Writing/Saving RDD to HDFS using saveAsTextFile

2016-10-07 Thread Deepak Sharma
Hi Mahendra Did you tried mapping the X case class members further to a String object and then saving the RDD[String] ? Thanks Deepak On Oct 7, 2016 23:04, "Mahendra Kutare" wrote: > Hi, > > I am facing issue with writing RDD[X] to HDFS file path. X is a simple > case class with variable time

Writing/Saving RDD to HDFS using saveAsTextFile

2016-10-07 Thread Mahendra Kutare
Hi, I am facing issue with writing RDD[X] to HDFS file path. X is a simple case class with variable time as primitive long. When I run the driver program with - master as spark://:7077 I get this - Caused by: java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.readFully(Ob

Executor errors out connecting to external shuffle service when using dynamic allocation

2016-10-07 Thread Manoj Samel
Resending with more clear subject. Any feedback ? On Tue, Oct 4, 2016 at 4:43 PM, Manoj Samel wrote: > Hi, > > On a secure hadoop cluster, spark shuffle is enabled (spark 1.6.0, shuffle > jar is spark-1.6.0-yarn-shuffle.jar). A client connecting using > spark-assembly_2.11-1.6.1.jar gets errors

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread 安全部
Thanks for replying. When could you send out the PR? 发件人: Yanbo Liang mailto:yblia...@gmail.com>> 日期: 2016年10月7日 星期五 下午11:35 至: didi mailto:wangleikidd...@didichuxing.com>> 抄送: "d...@spark.apache.org" mailto:d...@spark.apache.org>>, "user@spark.apache.org

Is there a way to pause spark job

2016-10-07 Thread Evgenii Morozov
Hi! We’re training few RandomForest models and we wonder if there is a way to pause one particular model (we use our own web-service and we train those from different application threads)? I could do that manually in the middle of my own RDD processing (between the actions) if that would be req

Spark 2.0 Encoder().schema() is sorting StructFields

2016-10-07 Thread Paul Stewart
When using the Encoder(Bean.class).schema() method to generate the StructType array of StructFields the class attributes are returned as a sorted list and not in the defined order within the Bean.class. This makes the schema unusable for reading from a CSV file where the ordering of the attribute

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread Yanbo Liang
It's a good question and I had similar requirement in my work. I'm copying the implementation from mllib to ml currently, and then exposing the maximum log likelihood. I will send this PR soon. Thanks. Yanbo On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) wrote: > > Hi, > > Do you guys sometimes need t

SaveToCassandra - how to handle failed inserts?

2016-10-07 Thread Pablo Federigi
Hello In the next example I'm using the method saveToCassandra from the spark-cassandra connector RDDJavaFunctions> dsJF1 = CassandraJavaUtil.javaFunctions(result); dsJF1.writerBuilder("test_keyspace", "test", CassandraJavaUtil.mapTupleToRow(String.class, Integer.class))

Spark SQL Thriftserver with HBase

2016-10-07 Thread Benjamin Kim
Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data

Re: When will spark 2.0.1 be available in maven repo?

2016-10-07 Thread Luciano Resende
There seems to be an issue with the 2.0.1 artifacts and the Apache Infrastructure is investigating. For details, follow https://issues.apache.org/jira/browse/INFRA-12717 On Fri, Oct 7, 2016 at 7:39 AM, Sushrut Ikhar wrote: > > Regards, > > Sushrut Ikhar > [image: https://]about.me/sushrutikhar

When will spark 2.0.1 be available in maven repo?

2016-10-07 Thread Sushrut Ikhar
Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar

Re: RESTful Endpoint and Spark

2016-10-07 Thread Benjamin Kim
It would appear the simple answer is to use the JDBC thriftserver in Spark. Thanks, Ben > On Oct 6, 2016, at 9:38 PM, Matei Zaharia wrote: > > This is exactly what the Spark SQL Thrift server does, if you just want to > access it using JDBC. > > Matei > >> On Oct 6, 2016, at 4:27 PM, Benjami

Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-07 Thread Kazuaki Ishizaki
Hi Chin Wei, Yes, since you force to create a cache by executing df.count, Spark starts to get data from cache for the following task: val res = sqlContext.sql("table1 union table2 union table3") res.collect() If you insert 'res.explain', you can confirm which resource you use to get data, cache

SqlContext in below code

2016-10-07 Thread Mich Talebzadeh
What is the equivalent of this code in Spark 2? import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext import org.apache.phoenix.spark._ val sc = new SparkContext("local", "phoenix-test") val sqlContext = new SQLContext(sc) val df = sqlContext.load( "org.apache.phoenix.spa

Re: Spark REST API YARN client mode is not full?

2016-10-07 Thread Vladimir Tretyakov
Thx for answer Vadim. Started application as: spark-submit --master yarn-client --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/lib/spark-examples-1.6.0-cdh5.8.0-hadoop2.6.0-cdh5.8.0.jar 3 Performed few requests: curl http://localhost:4040/api/v1/applications

Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-07 Thread Chin Wei Low
Hi Ishizaki san, So there is a gap between res.collect and when I see this log: spark.SparkContext: Starting job: collect at :26 What you mean is, during this time Spark already start to get data from cache? Isn't it should only get the data after the job is started and tasks are distributed?

How to resubmit the job after it is done?

2016-10-07 Thread kant kodali
I am currently not using spark streaming. I have a ETL pipeline and I want to just resubmit the job after it is done. Like a typical cron job. is that possible?

Re: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.

2016-10-07 Thread Aditya
Hi Saurav, Please share spark-submit command which you used. On Friday 07 October 2016 02:41 PM, Saurav Sinha wrote: I am submitting job by spark-submit but still it is giving message. Please use spark-submit. Can any one give me resone for this error. Thanks, Saurav Sinha On Thu, Oct 6, 20

Re: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.

2016-10-07 Thread Saurav Sinha
I am submitting job by spark-submit but still it is giving message. Please use spark-submit. Can any one give me resone for this error. Thanks, Saurav Sinha On Thu, Oct 6, 2016 at 3:38 PM, Saurav Sinha wrote: > I did not get you I am submitting job by spark-submit but still it is > giving mes

Re: java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset

2016-10-07 Thread kant kodali
perfect! That fixes it all! On Fri, Oct 7, 2016 1:29 AM, Denis Bolshakov bolshakov.de...@gmail.com wrote: You need to have spark-sql, now you are missing it. 7 Окт 2016 г. 11:12 пользователь "kant kodali" написал: Here are the jar files on my classpath after doing a grep for spark jars. o

Re: java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset

2016-10-07 Thread Denis Bolshakov
You need to have spark-sql, now you are missing it. 7 Окт 2016 г. 11:12 пользователь "kant kodali" написал: > Here are the jar files on my classpath after doing a grep for spark jars. > > org.apache.spark/spark-core_2.11/2.0.0/c4d04336c142f10eb7e172155f022f > 86b6d11dd3/spark-core_2.11-2.0.0.jar

issue accessing Phoenix table from Spark

2016-10-07 Thread Mich Talebzadeh
Hi, my code is trying to load a phoenix table built on an Hbase table. import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.hadoop.conf.Configuration import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.hbase.HColumnDescriptor import org.a

Re: java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset

2016-10-07 Thread kant kodali
Here are the jar files on my classpath after doing a grep for spark jars. org.apache.spark/spark-core_2.11/2.0.0/c4d04336c142f10eb7e172155f022f86b6d11dd3/spark-core_2.11-2.0.0.jar org.apache.spark/sparkstreaming_2.11/2.0.0/7227cbd39f5952b0ed3579bc78463bcc318ecd2b/spark-streaming_2.11-2.0.0.jar co

Re: MLlib: word2vec - words vectors into feature vector

2016-10-07 Thread Sean Owen
It's just the average of the word vectors, for all words in the text. On Fri, Oct 7, 2016 at 9:04 AM kaching wrote: > Hi. How exacly MLlib implementation of word2vec converts word vectors > into one feature vector per row? > >TEXT > [Hi, I, heard, ab..] > [I, wish, Java, c..] > [Logi

MLlib: word2vec - words vectors into feature vector

2016-10-07 Thread kaching
Hi. How exacly MLlib implementation of word2vec converts word vectors into one feature vector per row? TEXT [Hi, I, heard, ab..] [I, wish, Java, c..] [Logistic, regres.] | word2vec V WORD VECTOR heard[0.14950960874557...| are

Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-07 Thread Kazuaki Ishizaki
Hi, I think that the result looks correct. The current Spark spends extra time for getting data from a cache. There are two reasons. One is for a complicated path to get a data. The other is for decompression in the case of a primitive type. The new implementation (https://github.com/apache/spar

RE: spark standalone with multiple workers gives a warning

2016-10-07 Thread Mendelson, Assaf
I am using the script in sbin to set it up (spark/sbin/start-all.sh). It works fine. The problem is how to configure more than one worker per node (the default is one worker only). The documentation for 1.6.1 suggested SPARK_WORKER_INSTANCES as the way to do it but the latest documentation has n