NoSuchElementException

2016-11-04 Thread Lev Tsentsiper
My code throws an exception when I am trying to create new DataSet from within SteamWriter sink Simplified version of the code val df = sparkSession.readStream .format("json") .option("nullValue", " ") .option("headerFlag", "true") .option("spark.sql.shuffle.partitions", 1)

Upgrading to Spark 2.0.1 broke array in parquet DataFrame

2016-11-04 Thread Sam Goodwin
I have a table with a few columns, some of which are arrays. Since upgrading from Spark 1.6 to Spark 2.0.1, the array fields are always null when reading in a DataFrame. When writing the Parquet files, the schema of the column is specified as StructField("packageIds",ArrayType(StringType)) The s

Re: sanboxing spark executors

2016-11-04 Thread Michael Gummelt
Mesos will let you run in docker containers, so you get filesystem isolation, and we're about to merge CNI support: https://github.com/apache/spark/pull/15740, which would allow you to set up network policies. Though you might be able to achieve whatever network isolation you need without CNI, dep

Re: sanboxing spark executors

2016-11-04 Thread blazespinnaker
In particular, we need to make sure the RDDs execute the lambda functions securely as they are provided by user code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sanboxing-spark-executors-tp28014p28024.html Sent from the Apache Spark User List mailing li

java.util.NoSuchElementException when trying to use dataset from worker

2016-11-04 Thread levt
My code throws an exception when I am trying to create new DataSet from within SteamWriter sink Simplified version of the code   val df = sparkSession.readStream     .format("json")     .option("nullValue", " ")     .option("headerFlag", "true")     .option("spark.sql.shuffle.partitions", 1

Re: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$ . Please Help!!!!!!!

2016-11-04 Thread shyla deshpande
I feel so good that Holden replied. Yes, that was the problem. I was running from Intellij, I removed the provided scope and works great. Thanks a lot. On Fri, Nov 4, 2016 at 2:05 PM, Holden Karau wrote: > It seems like you've marked the spark jars as provided, in this case they > would only b

Re: sanboxing spark executors

2016-11-04 Thread Calvin Jia
Hi, If you are using the latest Alluxio release (1.3.0), authorization is enabled, preventing users from accessing data they do not have permissions to. For older versions, you will need to enable the security flag. The documentation on security

Re: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$ . Please Help!!!!!!!

2016-11-04 Thread Holden Karau
It seems like you've marked the spark jars as provided, in this case they would only be provided you run your application with spark-submit or otherwise have Spark's JARs on your class path. How are you launching your application? On Fri, Nov 4, 2016 at 2:00 PM, shyla deshpande wrote: > object A

java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$ . Please Help!!!!!!!

2016-11-04 Thread shyla deshpande
object App { import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession def main(args : Array[String]) { println( "Hello World!" ) val sparkSession = SparkSession.builder. master("local") .appName("spark session example") .getOrCreate() } }

NoSuchElementException when trying to use dataset

2016-11-04 Thread levt
My code throw an exception when I am trying to create new DataSet from within SteamWriter sink Simplified version of the code   val df = sparkSession.readStream     .format("json")     .option("nullValue", " ")     .option("headerFlag", "true")     .option("spark.sql.shuffle.partitions", 1)

Spark Float to VectorUDT for ML evaluator lib

2016-11-04 Thread Manish Tripathi
Hi I am trying to run the ML Binary Evaluation Classifier metrics to compare the rating with predicted values and get the AreaROC. My dataframe has two columns with rating as int (I have binarized it) and predicitions which is a float. When I pass it to the ML evaluator method I get an error as

Re: GenericRowWithSchema cannot be cast to java.lang.Double : UDAF error

2016-11-04 Thread Manjunath, Kiran
Just to add more clarity on where the issue occurs – org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 2, localhost): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-04 Thread Cody Koeninger
So basically what I am saying is - increase poll.ms - use a separate group id everywhere - stop committing offsets under the covers That should eliminate all of those as possible causes, and then we can see if there are still issues. As far as 0.8 vs 0.10, Spark doesn't require you to assign or

GenericRowWithSchema cannot be cast to java.lang.Double : UDAF error

2016-11-04 Thread Manjunath, Kiran
I am trying to implement a sample “sum” functionality over rolling window. Below code may not make sense (may not be efficient) but during the course of other major implementation, have stumbled on below error which is blocking. Error Obtained - “GenericRowWithSchema cannot be cast to java.lang.

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-04 Thread Ivan von Nagy
Yes, the parallel KafkaRDD uses the same consumer group, but each RDD uses a single distinct topic. For example, the group would be something like "storage-group", and the topics would be "storage-channel1", and "storage-channel2". In each thread a KafkaConsumer is started, assigned the partitions

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-04 Thread Cody Koeninger
So just to be clear, the answers to my questions are - you are not using different group ids, you're using the same group id everywhere - you are committing offsets manually Right? If you want to eliminate network or kafka misbehavior as a source, tune poll.ms upwards even higher. You must use

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Koert Kuipers
okay i see the partition local sort. got it. i would expect that pushing the partition local sort into shuffle would give a signficicant boost. but thats just a guess. On Fri, Nov 4, 2016 at 2:39 PM, Michael Armbrust wrote: > sure, but then my values are not sorted per key, right? > > > It does

Re: Clustering Webpages using KMean and Spark Apis : GC limit exceed.

2016-11-04 Thread Reth RM
Called the repartition method at line 2 in the above code, and the error is no more reported. JavaRDD> terms = getContent(input).repartition(10); But I am curious if this is correct approach and for any inputs/suggestions towards optimization of the above code? On Fri, Nov 4, 2016 at 11:13 AM,

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-04 Thread vonnagy
Here are some examples and details of the scenarios. The KafkaRDD is the most error prone to polling timeouts and concurrentm modification errors. *Using KafkaRDD* - This takes a list of channels and processes them in parallel using the KafkaRDD directly. they all use the same consumer group ('st

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Michael Armbrust
> > sure, but then my values are not sorted per key, right? It does do a partition local sort. Look at the query plan in my example . The

Clustering Webpages using KMean and Spark Apis : GC limit exceed.

2016-11-04 Thread Reth RM
Hi, Can you please guide me through parallelizing the task of extracting webpages text, converting text to doc vectors and finally applying k-mean. I get a "GC overhead limit exceeded at java.util.Arrays.copyOfRange" at task 3 below. detail stack trace : https://jpst.it/P33P Right now webpage

SAXParseException while writing to parquet on s3

2016-11-04 Thread lminer
I'm trying to read in some json, infer a schema, and write it out again as parquet to s3 (s3a). For some reason, about a third of the way through the writing portion of the run, spark always errors out with the error included below. I can't find any obvious reasons for the issue: it isn't out of me

Re: Error creating SparkSession, in IntelliJ

2016-11-04 Thread shyla deshpande
I have built many projects using IntellJ, maven using prior version of Spark. If anyone has a successful project with Kafka, Spark 2.0.1, Cassandra, Please share the pom.xml file. Thanks -S On Thu, Nov 3, 2016 at 10:03 PM, Hyukjin Kwon wrote: > Hi Shyla, > > there is the documentation for settin

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-04 Thread Cody Koeninger
- are you using different group ids for the different streams? - are you manually committing offsets? - what are the values of your kafka-related settings? On Fri, Nov 4, 2016 at 12:20 PM, vonnagy wrote: > I am getting the issues using Spark 2.0.1 and Kafka 0.10. I have two jobs, > one that uses

Re: Delegation Token renewal in yarn-cluster

2016-11-04 Thread Marcelo Vanzin
On Fri, Nov 4, 2016 at 1:57 AM, Zsolt Tóth wrote: > This was what confused me in the first place. Why does Spark ask for new > tokens based on the renew-interval instead of the max-lifetime? It could be just a harmless bug, since tokens have a "getMaxDate()" method which I assume returns the toke

Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-04 Thread vonnagy
I am getting the issues using Spark 2.0.1 and Kafka 0.10. I have two jobs, one that uses a Kafka stream and one that uses just the KafkaRDD. With the KafkaRDD, I continually get the "Failed to get records .. after polling". I have adjusted the polling with `spark.streaming.kafka.consumer.poll.ms`

Re: Deep learning libraries for scala

2016-11-04 Thread Masood Krohy
If you need ConvNets and RNNs and want to stay in Scala/Java, then Deep Learning for Java (DL4J) might be the most mature option. If you want ConvNets and RNNs, as implemented in TensorFlow, along with all the bells and whistles, then you might want to switch to PySpark + TensorFlow and write

Re: expected behavior of Kafka dynamic topic subscription

2016-11-04 Thread Cody Koeninger
That's not what I would expect from the underlying kafka consumer, no. But this particular case (no matching topics, then add a topic after SubscribePattern stream starts) actually isn't part of unit tests for either the DStream or the structured stream. I'll make a jira ticket. On Thu, Nov 3, 2

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Koert Kuipers
i just noticed Sort for Dataset has a global flag. and Dataset also has sortWithinPartitions. how about: repartition + sortWithinPartitions + mapPartitions? the plan looks ok, but it is not clear to me if the sort is done as part of the shuffle (which is the important optimization). scala> val d

Re: How do I convert a data frame to broadcast variable?

2016-11-04 Thread Jain, Nishit
Awesome, thanks Silvio! From: Silvio Fiorito mailto:silvio.fior...@granturing.com>> Date: Thursday, November 3, 2016 at 12:26 PM To: "Jain, Nishit" mailto:nja...@underarmour.com>>, Denny Lee mailto:denny.g@gmail.com>>, "user@spark.apache.org" mailto:user@spark

Re: sanboxing spark executors

2016-11-04 Thread Andrew Holway
I think running it on a Mesos cluster could give you better control over this kinda stuff. On Fri, Nov 4, 2016 at 7:41 AM, blazespinnaker wrote: > Is there a good method / discussion / documentation on how to sandbox a > spark > executor? Assume the code is untrusted and you don't want it to

Re: Aggregation Calculation

2016-11-04 Thread Andrés Ivaldi
Ok, so I've read that rollup is just syntactic sugar of GROUPING SET(...), in that case I just need to use GROUPNG SET, but the examples in the documentation this GROUPING SET is used with SQL syntaxis and I am doing it programmatically, so I need the DataSet api, like ds.rollup(..) but for groupin

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Koert Kuipers
sure, but then my values are not sorted per key, right? so a group by key with values sorted according to to some ordering is an operation that can be done efficiently in a single shuffle without first figuring out range boundaries. and it is needed for quite a few algos, including Window and lots

WARN 1 block locks were not released with MLlib ALS

2016-11-04 Thread Mikael Ståldal
I get a few warnings like this in Spark 2.0.1 when using org .apache.spark.mllib.recommendation.ALS: WARN org.apache.spark.executor.Executor - 1 block locks were not released by TID = 1448: [rdd_239_0] What can be the reason for that? -- [image: MagineTV] *Mikael Ståldal* Senior software dev

Re: example LDA code ClassCastException

2016-11-04 Thread Tamas Jambor
thanks for the reply. Asher, have you experienced problem when checkpoints are not enabled as well? If we have large number of iterations (over 150) and checkpoints are not enabled, the process just hangs (without no error) at around iteration 120-140 (on spark 2.0.0). I could not reproduce this o

Re: Delegation Token renewal in yarn-cluster

2016-11-04 Thread Steve Loughran
On 4 Nov 2016, at 01:37, Marcelo Vanzin mailto:van...@cloudera.com>> wrote: On Thu, Nov 3, 2016 at 3:47 PM, Zsolt Tóth mailto:toth.zsolt@gmail.com>> wrote: What is the purpose of the delegation token renewal (the one that is done automatically by Hadoop libraries, after 1 day by default)? I

Re: sanboxing spark executors

2016-11-04 Thread Steve Loughran
> On 4 Nov 2016, at 06:41, blazespinnaker wrote: > > Is there a good method / discussion / documentation on how to sandbox a spark > executor? Assume the code is untrusted and you don't want it to be able to > make un validated network connections or do unvalidated alluxio/hdfs/file use Kerb

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-04 Thread Carlo . Allocca
Hi Robin, On 4 Nov 2016, at 09:19, Robin East mailto:robin.e...@xense.co.uk>> wrote: Hi Do you mean the test of significance that you usually get with R output? Yes, exactly. I don’t think there is anything implemented in the standard MLLib libraries however I believe that the sparkR version

Re: Is Spark launcher's listener API considered production ready?

2016-11-04 Thread Aseem Bansal
Anyone has any idea about this? On Thu, Nov 3, 2016 at 12:52 PM, Aseem Bansal wrote: > While using Spark launcher's listener we came across few cases where the > failures were not being reported correctly. > > >- https://issues.apache.org/jira/browse/SPARK-17742 >- https://issues.apache.

InvalidClassException when load KafkaDirectStream from checkpoint (Spark 2.0.0)

2016-11-04 Thread Haopu Wang
When I load spark checkpoint, I get below error. Do you have any idea? Much thanks! * 2016-11-04 17:12:19,582 INFO [org.apache.spark.streaming.CheckpointReader] (main;) Checkpoint files found: file:/d:/temp/checkpoint/checkpoint-147825070,file:/d:/temp/checkpoi

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-04 Thread Robin East
Hi Do you mean the test of significance that you usually get with R output? I don’t think there is anything implemented in the standard MLLib libraries however I believe that the sparkR version provides that. See http://spark.apache.org/docs/1.6.2/sparkr.html#gaussian-glm-model --

Re: Delegation Token renewal in yarn-cluster

2016-11-04 Thread Zsolt Tóth
I checked the logs of my tests, and found that the Spark schedules the token refresh based on the renew-interval property, not the max-lifetime. The settings in my tests: dfs.namenode.delegation.key.update-interval=52 dfs.namenode.delegation.token.max-lifetime=102 dfs.namenode.delegation.t

Vector is not found in case class after import

2016-11-04 Thread Yan Facai
Hi, My spark-shell version is 2.0.1. I import the Vector and hope to use it in case class, while spark-shell throws an error: not found. scala> import org.apache.spark.ml.linalg.{Vector => OldVector} import org.apache.spark.ml.linalg.{Vector=>OldVector} scala> case class Movie(vec: OldVector) :

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-04 Thread Carlo . Allocca
Hi Mohit, Thank you for your reply. OK. it means coefficient with high score are more important that other with low score… Many Thanks, Best Regards, Carlo > On 3 Nov 2016, at 20:41, Mohit Jaggi wrote: > > For linear regression, it should be fairly easy. Just sort the co-efficients > :) >