Re: Python vs. Scala

2017-09-06 Thread Conconscious
Just run by yourself this test and check the results. During the run also check with top a worker. Python: import random def inside(p): x, y = random.random(), random.random() return x * x + y * y < 1 def estimate_pi(num_samples): count = sc.parallelize(xrange(0, num_samples)).filte

Will an input event older than watermark be dropped?

2017-09-06 Thread 张万新
Hi, I'm investigating the watermark for some time, according to the guide, if we specify a watermark on event time column, the watermark will be used to drop old state data. Then, take window-based count for example, if an event whose time is older than the watermark comes, it will be simply dropp

Multiple Kafka topics processing in Spark 2.2

2017-09-06 Thread Dan Dong
Hi, All, I have one issue here about how to process multiple Kafka topics in a Spark 2.* program. My question is: How to get the topic name from a message received from Kafka? E.g: .. val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, k

Re: Multiple Kafka topics processing in Spark 2.2

2017-09-06 Thread Alonso Isidoro Roman
Hi, reading the official doc , i think you can do it this way: import org.apache.spark.streaming.kafka._ val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, ka

Spark Structured Streaming and compacted topic in Kafka

2017-09-06 Thread Olivier Girardot
Hi everyone, I'm aware of the issue regarding direct stream 0.10 consumer in spark and compacted topics (c.f. https://issues.apache.org/jira/browse/SPARK-17147). Is there any chance that spark structured-streaming kafka is compatible with compacted topics ? Regards, -- *Olivier Girardot*

Non-time-based windows are not supported on streaming DataFrames/Datasets;;

2017-09-06 Thread kant kodali
Hi All, I get the following exception when I run the query below. Not sure what the cause is? "Non-time-based windows are not supported on streaming DataFrames/Datasets;" My TIME_STAMP field is of string type. dataset .sqlContext() .sql("select count(distinct ID), sum(AMOUNT) from (s

spark metrics prefix in Graphite is duplicated

2017-09-06 Thread Mikhailau, Alex
Hi guys, When I set up my EMR cluster with Spark I add "*.sink.graphite.prefix": "$env.$namespace.$team.$app" to metrics.properties The cluster comes up with correct metrics.properties Then I simply add-step to EMR with spark-submit without any metrics namespace parameter. In my Graphite, Spar

[Meetup] Apache Spark and Ignite for IoT scenarious

2017-09-06 Thread Denis Magda
Folks, Those who are craving for mind food this weekend come over the meetup - Santa Clara, Sept 9, 9.30 AM: https://www.meetup.com/datariders/events/242523245/?a=socialmedia — Denis

[ANNOUNCE] Apache Bahir 2.2.0 Released

2017-09-06 Thread Luciano Resende
Apache Bahir provides extensions to multiple distributed analytic platforms, extending their reach with a diversity of streaming connectors and SQL data sources. The Apache Bahir community is pleased to announce the release of Apache Bahir 2.2.0 which provides the following extensions for Apache Sp

is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-06 Thread kant kodali
Hi All, I am wondering if it is ok to have multiple sparksession's in one spark structured streaming app? Basically, I want to create 1) Spark session for reading from Kafka and 2) Another Spark session for storing the mutations of a dataframe/dataset to a persistent table as I get the mutations f

sessionState could not be accessed in spark-shell command line

2017-09-06 Thread ChenJun Zou
Hi, when I use spark-shell to get the logical plan of sql, an error occurs scala> spark.sessionState :30: error: lazy value sessionState in class SparkSession cannot be accessed in org.apache.spark.sql.SparkSession spark.sessionState ^ But if I use spark-submit to access the

CSV write to S3 failing silently with partial completion

2017-09-06 Thread abbim
Hi all, My team has been experiencing a recurring unpredictable bug where only a partial write to CSV in S3 on one partition of our Dataset is performed. For example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 of the partitions as 2.8 GB in size, but one of them as 1.6 GB. H

Re: sessionState could not be accessed in spark-shell command line

2017-09-06 Thread ChenJun Zou
spark-2.1.1 I use 2017-09-07 14:00 GMT+08:00 sujith chacko : > Hi, > may I know which version of spark you are using, in 2.2 I tried with > below query in spark-shell for viewing the logical plan and it's working > fine > > spark.sql("explain extended select * from table1") > > The above qu

Re: sessionState could not be accessed in spark-shell command line

2017-09-06 Thread ChenJun Zou
thanks, my mistake 2017-09-07 14:21 GMT+08:00 sujith chacko : > If your intention is to just view the logical plan in spark shell then I > think you can follow the query which I mentioned in previous mail. In > spark 2.1.0 sessionState is a private member which you cannot access. > > Thanks. >

Re: sessionState could not be accessed in spark-shell command line

2017-09-06 Thread ChenJun Zou
I am examined the code and found lazy val is added recently in 2.2.0 2017-09-07 14:34 GMT+08:00 ChenJun Zou : > thanks, > my mistake > > 2017-09-07 14:21 GMT+08:00 sujith chacko : > >> If your intention is to just view the logical plan in spark shell then I >> think you can follow the query whic