Streaming Reduction on a stream of Arrays

2019-10-21 Thread vibhatha
Hi, I have an issue is reduce in streaming. I don't get a reduced stream when I use a custom object. Here is the code snippet that I used to test this. Issue is, the reduction clearly works for a simple sequence, but what I really want is to do is send an array by adding such an array at a t

Re: [pyspark 2.4.3] nested windows function performance

2019-10-21 Thread Georg Heiler
No, as you shuffle each time again (you always partition by different windows) Instead: could you choose a single window (w3 with more columns =fine granular) and the nfilter out records to achieve the same result? Or instead: df.groupBy(a,b,c).agg(sort_array(collect_list(foo,bar,baz)) and then an

Re: Issue : KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey

2019-10-21 Thread Gabor Somogyi
With the mentioned parameter the capacity can be increased but the main question is more like why is that happening. Even on a really beefy machine having more than 64 consumers is quite extreme. Maybe better horizontal scaling (more executors) would be a better option to reach maximum performance.

答复: Issue : KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey

2019-10-21 Thread peter
You can try improve setting spark.streaming.kafka.consumer.cache.maxCapacity 发件人: Shyam P [mailto:shyamabigd...@gmail.com] 发送时间: 2019年10月21日 20:43 收件人: kafka-clie...@googlegroups.com; spark users 主题: Issue : KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey H

Issue : KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey

2019-10-21 Thread Shyam P
Hi , I am using spark-sql-2.4.1v with kafka I am facing slow consumer issue I see warning "KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey(spark-kafka-source-33321dde-bfad-49f3-bdf7-09f95883b6e9--1249540122-executor)" in logs more on the same https://stackoverfl

Re: Monitor executor and task memory getting used

2019-10-21 Thread Sriram Ganesh
I looked into this. But I found it is possible like this https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L229 Line no 230. This is for executors. Just wanna cross verify is that right? On Mon, 21 Oct 2019, 17:24 Alonso Isidoro Rom

Re: [pyspark 2.4.3] nested windows function performance

2019-10-21 Thread Rishi Shah
Hi All, Any suggestions? Thanks, -Rishi On Sun, Oct 20, 2019 at 12:56 AM Rishi Shah wrote: > Hi All, > > I have a use case where I need to perform nested windowing functions on a > data frame to get final set of columns. Example: > > w1 = Window.partitionBy('col1') > df = df.withColumn('sum1',

Monitor executor and task memory getting used

2019-10-21 Thread Sriram Ganesh
Hi, I wanna monitor how much memory executor and task used for a given job. Is there any direct method available for it which can be used to track this metric? -- *Sriram G* *Tech*

Re: Spark 3.0 yarn does not support cdh5

2019-10-21 Thread melin li
Many clusters still use cdh5, and want to continue to support cdh5,cdh5 based on hadoop 2.6 melin li 于2019年10月21日周一 下午3:02写道: > 很多集群还是使用cdh5,希望继续支持cdh5,cdh5是基于hadoop 2.6 > > dev/make-distribution.sh --tgz -Pkubernetes -Pyarn -Phive-thriftserver > -Phive -Dhadoop.version=2.6.0-cdh5.15.0 -DskipTes

Spark 3.0 yarn does not support cdh5

2019-10-21 Thread melin li
很多集群还是使用cdh5,希望继续支持cdh5,cdh5是基于hadoop 2.6 dev/make-distribution.sh --tgz -Pkubernetes -Pyarn -Phive-thriftserver -Phive -Dhadoop.version=2.6.0-cdh5.15.0 -DskipTest ``` [INFO] Compiling 25 Scala sources to /Users/libinsong/Documents/codes/tongdun/spark-3.0/resource-managers/yarn/target/scala-2.12/