Regression of external shuffle service spark 2.3 vs spark 2.2

2018-11-19 Thread igor.berman
Hi, any inputs will be welcome regarding below We are running with external shuffle service. Mesos cluster(1.5.1) After upgrading our production workload to spark 2.3 we started to see OOM failures of external shuffle services(running on each node). Does anybody experienced same problems? Any dir

Driver doesn't respect the request to abort itself by Mesos

2018-06-24 Thread igor.berman
Hi, any inputs regarding following situation will be appreciated: We are running with dynamic allocation(spark v.2.2.0), i.e. with external shuffle service with Mesos cluster(1.1.0) Sometimes due to network failures and/or order of offers excepted by different frameworks the application framework s

Re: Driver aborts on Mesos when unable to connect to one of external shuffle services

2018-04-16 Thread igor.berman
Hi Szuromi, We manage external shuffle service by Marathon and not manually sometime though, eg. when adding new node to cluster there is some delay between mesos schedules tasks on some slave and marathon scheduling external shuffle service task on this node. -- Sent from: http://apache-spark-u

Driver aborts on Mesos when unable to connect to one of external shuffle services

2018-04-12 Thread igor.berman
Hi, any input regarding is it expected: Driver starts and unable to connect to external shuffle service on one of the nodes(no matter what is the reason) This makes framework to go to Inactive mode in Mesos UI However it seems that driver doesn't exits and continues to execute tasks(or tries to). T

Re: external shuffle service in mesos

2018-01-23 Thread igor.berman
Hi Susan, yes, agree with you regarding resource accounting. Imho, in this case shuffle service must run on node no matter what resources are available(same as we don't account for resources that "system" takes - mesos agent, OS itself and any other process that is running on same machine) One add

Re: external shuffle service in mesos

2018-01-21 Thread igor.berman
Hi Susan In general I can get what I need without Marathon, with configuring external-shuffle-service with puppet/ansible/chef + maybe some alerts for checks. I mean in companies that don't have strong Devops teams and want to install services as simple as possible just by config - Marathon might

external shuffle service in mesos

2018-01-20 Thread igor.berman
Hi, wanted to get some advice regarding managing external shuffle service in mesos environments In spark documentation the Marathon is mentioned, however there is very limited documentation. I've tried to search for some documentation and it's seems not too difficult to configure it under Marathon

Hive api vs Dataset api

2016-09-16 Thread igor.berman
Hi, I wanted to understand if there is any other advantage besides api syntax when using hive/table api vs. dataset api in spark sql(v2.0)? Any additional optimizations maybe? I'm most interested in parquet partitioned tables stored on s3. Is there any difference if I'm comfortable with dataset api

DirectFileOutputCommiter

2016-02-22 Thread igor.berman
Hi, Wanted to understand if anybody uses DirectFileOutputCommitter or alikes especially when working with s3? I know that there is one impl in spark distro for parquet format, but not for files - why? Imho, it can bring huge performance boost. Using default FileOutputCommiter with s3 has big ov

our spark gotchas report while creating batch pipeline

2015-10-18 Thread igor.berman
might be somebody will find it useful goo.gl/0yfvBd -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/our-spark-gotchas-report-while-creating-batch-pipeline-tp25112.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

log4j.xml bundled in jar vs log4.properties in spark/conf

2015-07-21 Thread igor.berman
Hi, I have log4j.xml in my jar >From 1.4.1 it seems that log4j.properties in spark/conf is defined first in classpath so the spark.conf/log4j.properties "wins" before that (in v1.3.0) log4j.xml bundled in jar defined the configuration if I manually add my jar to be strictly first in classpath(by a

1.4.1 in production

2015-07-20 Thread igor.berman
Hi, do somebody already uses version 1.4.1 in production? any problems? thanks in advance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-4-1-in-production-tp23909.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

upload to s3, UI Total Duration and Sum of Job Durations

2015-07-01 Thread igor.berman
Hi, Our job is reading files from s3, transforming/aggregating them and writing them back to s3. While investigating performance problems I've noticed that there is big difference between sum of job durations and Total duration which appears in UI After investigating it a bit the difference caused

spilling in-memory map of 5.1 MB to disk (272 times so far)

2015-06-26 Thread igor.berman
Hi, wanted to get some advice regarding tunning spark application I see for some of the tasks many log entries like this Executor task launch worker-38 ExternalAppendOnlyMap: Thread 239 spilling in-memory map of 5.1 MB to disk (272 times so far) (especially when inputs are considerable) I understan

missing part of the file while using newHadoopApi

2015-06-15 Thread igor.berman
Hi Have anyone experienced problem with uploading to s3 with s3n protocol with spark newHadoopApi, when job completes successfully(there is _SUCCESS marker), but in reality one of the parts of the file is missing ? Thanks in advance ps: we are trying s3a now(which needs upgrade to hadoop2.7) -

Re: Jobs aborted due to EventLoggingListener Filesystem closed

2015-06-08 Thread igor.berman
for the sake of the history : DON'T do System.exit within spark code -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Jobs-aborted-due-to-EventLoggingListener-Filesystem-closed-tp23202p23205.html Sent from the Apache Spark User List mailing list archive at Na

Jobs aborted due to EventLoggingListener Filesystem closed

2015-06-08 Thread igor.berman
I'm getting sometimes errors like below spark 1.3.1 history enabled to hdfs I've found few jiras but they seems to be resolved, e.g. https://issues.apache.org/jira/browse/SPARK-1475 any ideas? 2015-06-08 08:33:06.426 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.l

Re: union and reduceByKey wrong shuffle?

2015-05-31 Thread igor.berman
after investigation the problem is somehow connected to avro serialization with kryo + chill-avro(mapping avro object to simple scala case class and running reduce on these case class objects solves the problem) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.c

union and reduceByKey wrong shuffle?

2015-05-31 Thread igor.berman
I've encountered very strange problem, after doing union of 2 rdds the reduceByKey works wrong(unless I'm missing something very basic) and brings to the function that reduces 2 objects with different key! I've rewrited java class to scala to test it in spark-shell and I see same problem I have Sin

Re: spark java.io.FileNotFoundException: /user/spark/applicationHistory/application

2015-05-29 Thread igor.berman
in yarn your executors might run on every node in your cluster, so you need to configure spark history to be on hdfs(so it will be accessible to every executor) probably you've switched from local to yarn mode when submitting -- View this message in context: http://apache-spark-user-list.100156

Batch aggregation by sliding window + join

2015-05-28 Thread igor.berman
Hi, I have a batch daily job that computes daily aggregate of several counters represented by some object. After daily aggregation is done, I want to compute block of 3 days aggregation(3,7,30 etc) To do so I need to add new daily aggregation to the current block and then subtract from current bloc