ClassCastException while running a simple wordCount

2016-10-10 Thread vaibhav thapliyal
Dear All, I am getting a ClassCastException Error when using the JAVA API to run the wordcount example from the docs. Here is the log that I got: 16/10/10 11:52:12 ERROR Executor: Exception in task 0.2 in stage 0.0 (TID 4) java.lang.ClassCastException: cannot assign instance of scala.collection.

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread vaibhav thapliyal
Here is the code that I am using: public class SparkTest { public static void main(String[] args) { SparkConf conf = new SparkConf().setMaster("spark:// 192.168.10.174:7077").setAppName("TestSpark"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD textFile

Spark Streaming Custom Receivers - How to use metadata store API during processing

2016-10-10 Thread Manjunath, Kiran
Hello, This is a reposting of the question which is available in stackOverflow (posted by another user). http://stackoverflow.com/questions/35271270/how-to-access-metadata-stored-by-spark-streaming-custom-receiver Question: I need to store meta information (basically certain properties of the re

What happens when an executor crashes?

2016-10-10 Thread Samy Dindane
Hi, I am writing a streaming job that reads a Kafka topic. As far as I understand, Spark does a 1:1 mapping between its executors and Kafka partitions. In order to correctly implement my checkpoint logic, I'd like to know what exactly happens when an executors crashes. Also, is it possible to

Re: This Exception has been really hard to trace

2016-10-10 Thread kant kodali
Hi I use gradle and I don't think it really has "provided" but I was able to google and create the following file but the same error still persist. group 'com.company'version '1.0-SNAPSHOT' apply plugin: 'java'apply plugin: 'idea' repositories {mavenCentral()mavenLocal()} configurations {  

Re: Best approach for processing all files parallelly

2016-10-10 Thread Arun Patel
Ayan, which version of Python are you using? I am using 2.6.9 and I don't find generateFileType and getSchemaFor functions. Thanks for your help. On Fri, Oct 7, 2016 at 1:17 AM, ayan guha wrote: > Hi > > generateFileType (filename) returns FileType > > getSchemaFor(FileType) returns schema for

Re: Best approach for processing all files parallelly

2016-10-10 Thread ayan guha
Hi Sorry for confusion, but I meant those functions to be written by you. Those are you r business logic or etl logic On 10 Oct 2016 21:06, "Arun Patel" wrote: > Ayan, which version of Python are you using? I am using 2.6.9 and I don't > find generateFileType and getSchemaFor functions. Thanks

Re: What happens when an executor crashes?

2016-10-10 Thread Samy Dindane
I managed to make a specific executor crash by using TaskContext.get.partitionId and throwing an exception for a specific executor. The issue I have now is that the whole job stops when a single executor crashes. Do I need to explicitly tell Spark to start a new executor and keep the other ones

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread vaibhav thapliyal
Hi, If I change the parameter inside the setMaster() to "local", the program runs. Is there something wrong with the cluster installation? I used the spark-2.0.1-bin-hadoop2.7.tgz package to install on my cluster with default configuration. Thanks Vaibhav On 10 Oct 2016 12:49, "vaibhav thapliya

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread kant kodali
+1 Wooho I have the same problem. I have been trying hard to fix this. On Mon, Oct 10, 2016 3:23 AM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, If I change the parameter inside the setMaster()  to "local", the program runs. Is there something wrong with the cluster installa

spark using two different versions of netty?

2016-10-10 Thread Paweł Szulc
Hi, quick question, why is Spark using two different versions of netty?: - io.netty:netty-all:4.0.29.Final:jar - io.netty:netty:3.8.0.Final:jar ? -- Regards, Paul Szulc twitter: @rabbitonweb blog: www.rabbitonweb.com

Re: spark using two different versions of netty?

2016-10-10 Thread Sean Owen
Usually this sort of thing happens because the two versions are in different namespaces in different major versions and both are needed. That is true of Netty: http://netty.io/wiki/new-and-noteworthy-in-4.0.html However, I see that Spark declares a direct dependency on both, when it does not use 3.

Re: Map with state keys serialization

2016-10-10 Thread Joey Echeverria
Hi Ryan! Do you know where I need to configure Kryo for this? I already have spark.serializer=org.apache.spark.serializer.KryoSerializer in my SparkConf and I registered the class. Is there a different configuration setting for the state map keys? Thanks! -Joey On Sun, Oct 9, 2016 at 10:58 PM,

Logistic Regression Standardization in ML

2016-10-10 Thread Cesar
I have a question regarding how the default standardization in the ML version of the Logistic Regression (Spark 1.6) works. Specifically about the next comments in the Spark Code: /** * Whether to standardize the training features before fitting the model. * The coefficients of models will be alw

Re: Spark Streaming Advice

2016-10-10 Thread Kevin Mellott
Whilst working on this application, I found a setting that drastically improved the performance of my particular Spark Streaming application. I'm sharing the details in hopes that it may help somebody in a similar situation. As my program ingested information into HDFS (as parquet files), I notice

Re: Logistic Regression Standardization in ML

2016-10-10 Thread Sean Owen
(BTW I think it means "when no standardization is applied", which is how you interpreted it, yes.) I think it just means that if feature i is divided by s_i, then its coefficients in the resulting model will end up larger by a factor of s_i. They have to be divided by s_i to put them back on the sa

Re: What happens when an executor crashes?

2016-10-10 Thread Cody Koeninger
What is it you're actually trying to accomplish? On Mon, Oct 10, 2016 at 5:26 AM, Samy Dindane wrote: > I managed to make a specific executor crash by using > TaskContext.get.partitionId and throwing an exception for a specific > executor. > > The issue I have now is that the whole job stops when

Re: spark using two different versions of netty?

2016-10-10 Thread Paweł Szulc
Yeah, I should be more precise. Those are two direct dependencies. On Mon, Oct 10, 2016 at 1:15 PM, Sean Owen wrote: > Usually this sort of thing happens because the two versions are in > different namespaces in different major versions and both are needed. That > is true of Netty: http://netty.

Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-10 Thread static-max
Hi, by following this article I managed to consume messages from Kafka 0.10 in Spark 2.0: http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html However, the Java examples are missing and I would like to commit the offset myself after processing the RDD. Does anybody have a wor

Re: Inserting New Primary Keys

2016-10-10 Thread Jean Georges Perrin
Is there only one process adding rows? because this seems a little risky if you have multiple threads doing that… > On Oct 8, 2016, at 1:43 PM, Benjamin Kim wrote: > > Mich, > > After much searching, I found and am trying to use “SELECT ROW_NUMBER() > OVER() + b.id_max AS id, a.* FROM source

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-10 Thread Matthias Niehoff
Yes, without commiting the data the consumer rebalances. The job consumes 3 streams process them. When consuming only one stream it runs fine. But when consuming three streams, even without joining them, just deserialize the payload and trigger an output action it fails. I will prepare code sample

Re: Logistic Regression Standardization in ML

2016-10-10 Thread Yanbo Liang
AFAIK, we can guarantee with/without standardization, the models always converged to the same solution if there is no regularization. You can refer the test casts at: https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L

Re: Spark Streaming Advice

2016-10-10 Thread Mich Talebzadeh
Hi Kevin, What is the streaming interval (batch interval) above? I do analytics on streaming trade data but after manipulation of individual messages I store the selected on in Hbase. Very fast. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6

Re: Map with state keys serialization

2016-10-10 Thread Shixiong(Ryan) Zhu
That's enough. Did you see any error? On Mon, Oct 10, 2016 at 5:08 AM, Joey Echeverria wrote: > Hi Ryan! > > Do you know where I need to configure Kryo for this? I already have > spark.serializer=org.apache.spark.serializer.KryoSerializer in my > SparkConf and I registered the class. Is there a

JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
Hi folks, I am trying to parse JSON arrays and it’s getting a little crazy (for me at least)… 1) If my JSON is: {"vals":[100,500,600,700,800,200,900,300]} I get: ++ |vals| ++ |[100, 500, 600, 7...| ++ root |-- vals: a

Re: Inserting New Primary Keys

2016-10-10 Thread Benjamin Kim
Jean, I see your point. For the incremental data, which is very small, I should make sure that the PARTITION BY in the OVER(PARTITION BY ...) is left out so that all the data will be in one partition when assigned a row number. The query below should avoid any problems. “SELECT ROW_NUMBER() OV

Re: What happens when an executor crashes?

2016-10-10 Thread Samy Dindane
I just noticed that you're the author of the code I linked in my previous email. :) It's helpful. When using `foreachPartition` or `mapPartitions`, I noticed I can't ask Spark to write the data on the disk using `df.write()` but I need to use the iterator to do so, which means losing the abili

Error: PartitioningCollection requires all of its partitionings have the same numPartitions.

2016-10-10 Thread cuevasclemente
Hello, I am having some interesting issues with a consistent error in spark that occurs when I'm working with dataframes that are the result of some amounts of joining and other transformations. PartitioningCollection requires all of its partitionings have the same numPartitions. It seems t

Large variation in spark in Task Deserialization Time

2016-10-10 Thread Pulasthi Supun Wickramasinghe
Hi All, I am seeing a huge variation on spark Task Deserialization Time for my collect and reduce operations. while most tasks complete within 100ms a few take mote than a couple of seconds which slows the entire program down. I have attached a screen shot of the web ui where you can see the varia

converting hBaseRDD to DataFrame

2016-10-10 Thread Mich Talebzadeh
Hi, I am trying to do some operation on an Hbase table that is being populated by Spark Streaming. Now this is just Spark on Hbase as opposed to Spark on Hive -> view on Hbase etc. I also have Phoenix view on this Hbase table. This is sample code scala> val tableName = "marketDataHbaseTest"

Re: JSON Arrays and Spark

2016-10-10 Thread Luciano Resende
Please take a look at http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets Particularly the note at the required format : Note that the file that is offered as *a json file* is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As

Re: What happens when an executor crashes?

2016-10-10 Thread Cody Koeninger
Glad it was helpful :) As far as executors, my expectation is that if you have multiple executors running, and one of them crashes, the failed task will be submitted on a different executor. That is typically what I observe in spark apps, if that's not what you're seeing I'd try to get help on th

Re: JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
Thanks! I am ok with strict rules (despite being French), but even: [{ "red": "#f00", "green": "#0f0" },{ "red": "#f01", "green": "#0f1" }] is not going through… Is there a way to see what he does not like? the JSON parser has been pretty good to me until recen

Re: Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-10 Thread Cody Koeninger
This should give you hints on the necessary cast: http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html#tab_java_2 The main ugly thing there is that the java rdd is wrapping the scala rdd, so you need to unwrap one layer via rdd.rdd() If anyone wants to work on a PR to update

Re: JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
Thanks Luciano - I think this is my issue :( > On Oct 10, 2016, at 2:08 PM, Luciano Resende wrote: > > Please take a look at > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > > > Part

Re: Map with state keys serialization

2016-10-10 Thread Joey Echeverria
I do, I get the stack trace in this gist: https://gist.github.com/joey/d3bf040af31e854b3be374e2c016d7e1 The class it references, com.rocana.data.Tuple, is registered with Kryo. Also, this is with 1.6.0 so if this behavior changed/got fixed in a later release let me know. -Joey On Mon, Oct 10, 2

Re: Spark Streaming Advice

2016-10-10 Thread Kevin Mellott
The batch interval was set to 30 seconds; however, after getting the parquet files to save faster I lowered the interval to 10 seconds. The number of log messages contained in each batch varied from just a few up to around 3500, with the number of partitions ranging from 1 to around 15. I will hav

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread Jakob Odersky
Ho do you submit the application? A version mismatch between the launcher, driver and workers could lead to the bug you're seeing. A common reason for a mismatch is if the SPARK_HOME environment variable is set. This will cause the spark-submit script to use the launcher determined by that environm

Spark S3

2016-10-10 Thread Selvam Raman
Hi, How spark reads data from s3 and runs parallel task. Assume I have a s3 bucket size of 35 GB( parquet file). How the sparksession will read the data and process the data parallel. How it splits the s3 data and assign to each executor task. ​Please share me your points. Note: if we have RDD

Re: Spark Streaming Advice

2016-10-10 Thread Jörn Franke
Your file size is too small this has a significant impact on the namenode. Use Hbase or maybe hawq to store small writes. > On 10 Oct 2016, at 16:25, Kevin Mellott wrote: > > Whilst working on this application, I found a setting that drastically > improved the performance of my particular Spar

Re: Spark S3

2016-10-10 Thread ayan guha
It really depends on the input format used. On 11 Oct 2016 08:46, "Selvam Raman" wrote: > Hi, > > How spark reads data from s3 and runs parallel task. > > Assume I have a s3 bucket size of 35 GB( parquet file). > > How the sparksession will read the data and process the data parallel. How > it sp

[Spark] RDDs are not persisting in memory

2016-10-10 Thread diplomatic Guru
Hello team, Spark version: 1.6.0 I'm trying to persist done data into memory for reusing them. However, when I call rdd.cache() OR rdd.persist(StorageLevel.MEMORY_ONLY()) it does not store the data as I can not see any rdd information under WebUI (Storage Tab). Therefore I tried rdd.persist(St

Re: What happens when an executor crashes?

2016-10-10 Thread Samy Dindane
On 10/10/2016 8:14 PM, Cody Koeninger wrote: Glad it was helpful :) As far as executors, my expectation is that if you have multiple executors running, and one of them crashes, the failed task will be submitted on a different executor. That is typically what I observe in spark apps, if that's

Design consideration for a trading System

2016-10-10 Thread Mich Talebzadeh
Hi, I have been working on some Lambda Architecture for trading systems. I think I have completed the dry runs for testing the modules. For batch layer the criteria is a day's lag (one day old data). This is acceptable for the users who come from BI background using Tableau but I think we can do

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread Jakob Odersky
Just thought of another potential issue: you should use the "provided" scope when depending on spark. I.e in your project's pom: org.apache.spark spark-core_2.11 2.0.1 provided On Mon, Oct 10, 2016 at 2:00 PM, Jakob Odersky wrote: > Ho do you sub

Re: What happens when an executor crashes?

2016-10-10 Thread Cody Koeninger
Repartition almost always involves a shuffle. Let me see if I can explain the recovery stuff... Say you start with two kafka partitions, topic-0 and topic-1. You shuffle those across 3 spark parittions, we'll label them A B and C. Your job is has written fileA: results for A, offset ranges for

Re: JSON Arrays and Spark

2016-10-10 Thread Hyukjin Kwon
FYI, it supports [{...}, {...} ...] Or {...} format as input. On 11 Oct 2016 3:19 a.m., "Jean Georges Perrin" wrote: > Thanks Luciano - I think this is my issue :( > > On Oct 10, 2016, at 2:08 PM, Luciano Resende wrote: > > Please take a look at > http://spark.apache.org/docs/latest/sql-pro

Re: [Spark] RDDs are not persisting in memory

2016-10-10 Thread Chin Wei Low
Hi, Your RDD is 5GB, perhaps it is too large to fit into executor's storage memory. You can refer to the Executors tab in Spark UI to check the available memory for storage for each of the executor. Regards, Chin Wei On Tue, Oct 11, 2016 at 6:14 AM, diplomatic Guru wrote: > Hello team, > > Spa

[no subject]

2016-10-10 Thread Fei Hu
Hi All, I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version 1.5.0). I customized the Spark interpreter to use org.apache.spark.serializer.KryoSerializer as spark.serializer. And in the dependency I added Kyro-3.0.3 as following: com.esotericsoftware:kryo:3.0.3 When I wrot

Kryo on Zeppelin

2016-10-10 Thread Fei Hu
Hi All, I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version 1.5.0). I customized the Spark interpreter to use org.apache.spark. serializer.KryoSerializer as spark.serializer. And in the dependency I added Kyro-3.0.3 as following: com.esotericsoftware:kryo:3.0.3 When I wro

GraphFrame BFS

2016-10-10 Thread cashinpj
Hello, I have a set of data representing various network connections. Each vertex is represented by a single id, while the edges have a source id, destination id, and a relationship (peer to peer, customer to provider, or provider to customer). I am trying to create a sub graph build around a s

Re: Spark S3

2016-10-10 Thread Selvam Raman
I mentioned parquet as input format. On Oct 10, 2016 11:06 PM, "ayan guha" wrote: > It really depends on the input format used. > On 11 Oct 2016 08:46, "Selvam Raman" wrote: > >> Hi, >> >> How spark reads data from s3 and runs parallel task. >> >> Assume I have a s3 bucket size of 35 GB( parquet

Re: converting hBaseRDD to DataFrame

2016-10-10 Thread Divya Gehlot
Hi Mich , you can create dataframe from RDD in below manner also val df = sqlContext.createDataFrame(rdd,schema) val df = sqlContext.createDataFrame(rdd) The below article also may help you : http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/ On 11 October 2016 at 02:02