Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-12 Thread Serega Sheypak
mething like this > > import org.apache.spark.TaskContext > ds.map(r => { > val taskContext = TaskContext.get() > if (taskContext.partitionId == 1000) { > throw new RuntimeException > } > r > }) > > On Mon, Feb 11, 2019 at 8:41 AM Serega Sheypak > wrote:

Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-11 Thread Serega Sheypak
I need to crash task which does repartition. пн, 11 февр. 2019 г. в 10:37, Gabor Somogyi : > What blocks you to put if conditions inside the mentioned map function? > > On Mon, Feb 11, 2019 at 10:31 AM Serega Sheypak > wrote: > >> Yeah, but I don't need to crash

Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-11 Thread Serega Sheypak
= input.toDS.map(_ / 0).writeStream.format("console").start() > > G > > > On Sun, Feb 10, 2019 at 9:36 PM Serega Sheypak > wrote: > >> Hi BR, >> thanks for your reply. I want to mimic the issue and kill tasks at a >> certain stage. Killing executor

Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-10 Thread Serega Sheypak
Happy killing... > > BR, > G > > > On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke wrote: > >> yarn application -kill applicationid ? >> >> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak > >: >> > >> > Hi there! >> > I have weir

Spark on YARN, HowTo kill executor or individual task?

2019-02-10 Thread Serega Sheypak
Hi there! I have weird issue that appears only when tasks fail at specific stage. I would like to imitate failure on my own. The plan is to run problematic app and then kill entire executor or some tasks when execution reaches certain stage. Is it do-able?

Spark 2.x duplicates output when task fails at "repartition" stage. Checkpointing is enabled before repartition.

2019-02-05 Thread Serega Sheypak
Hi, I have spark job that produces duplicates when one or tasks from repartition stage fails. Here is simplified code. sparkContext.setCheckpointDir("hdfs://path-to-checkpoint-dir") *val *inputRDDs: List[RDD[String]] = *List*.*empty *// an RDD per input dir *val *updatedRDDs = inputRDDs.map{ in

Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-23 Thread Serega Sheypak
I'm responding here in case you aren't watching that issue) > > On Tue, Jan 22, 2019 at 6:09 AM Jörn Franke wrote: > >> You can try with Yarn node labels: >> >> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeLabel.html >> >&g

Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-21 Thread Serega Sheypak
Hi Apiros, thanks for your reply. Is it this one: https://github.com/apache/spark/pull/23223 ? Can I try to reach you through Cloudera Support portal? пн, 21 янв. 2019 г. в 20:06, attilapiros : > Hello, I was working on this area last year (I have developed the > YarnAllocatorBlacklistTracker) a

Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-20 Thread Serega Sheypak
late such a > blacklist. > > If you can change yarn config, the equivalent is node label: > https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeLabel.html > > > > -- > *From:* Li Gao > *Sent:* Saturday, January 19, 2019 8:43 AM &g

Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-18 Thread Serega Sheypak
Hi, is there any possibility to tell Scheduler to blacklist specific nodes in advance?

Kill spark executor when spark runs specific stage

2018-07-04 Thread Serega Sheypak
Hi, I'm running spark on YARN. My code is very simple. I want to kill one executor when "data.repartition(10)" is executed. Ho can I do it in easy way? val data = sc.sequenceFile[NullWritable, BytesWritable](inputPath) .map { case (key, value) => Data.fromBytes(value) } process = data.repartitio

Re: how "hour" function in Spark SQL is supposed to work?

2018-03-20 Thread Serega Sheypak
Ok, this one works: .withColumn("hour", hour(from_unixtime(typedDataset.col("ts") / 1000))) 2018-03-20 22:43 GMT+01:00 Serega Sheypak : > Hi, any updates? Looks like some API inconsistency or bug..? > > 2018-03-17 13:09 GMT+01:00 Serega Sheypak : > >> &

Re: how "hour" function in Spark SQL is supposed to work?

2018-03-20 Thread Serega Sheypak
Hi, any updates? Looks like some API inconsistency or bug..? 2018-03-17 13:09 GMT+01:00 Serega Sheypak : > > Not sure why you are dividing by 1000. from_unixtime expects a long type > It expects seconds, I have milliseconds. > > > > 2018-03-12 6:16 GMT+01:00 vermanurag :

Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Serega Sheypak
8-03-19 13:41 GMT+01:00 Jörn Franke : > Maybe you should better run it in yarn cluster mode. Yarn client would > start the driver on the oozie server. > > On 19. Mar 2018, at 12:58, Serega Sheypak > wrote: > > I'm trying to run it as Oozie java action and reduce env d

Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Serega Sheypak
local mode but throws exception when I try to run spark as yarn-client пн, 19 марта 2018 г. в 7:16, Jacek Laskowski : > Hi, > > What's the deployment process then (if not using spark-submit)? How is the > AM deployed? Why would you want to skip spark-submit? > > Jacek >

Run spark 2.2 on yarn as usual java application

2018-03-18 Thread Serega Sheypak
Hi, Is it even possible to run spark on yarn as usual java application? I've built jat using maven with spark-yarn dependency and I manually populate SparkConf with all hadoop properties. SparkContext fails to start with exception: 1. Caused by: java.lang.IllegalStateException: Library director

Re: Append more files to existing partitioned data

2018-03-18 Thread Serega Sheypak
anaged by > > job.getConfiguration.set(DATASOURCE_WRITEJOBUUID, uniqueWriteJobId.toString) > > > On 17 March 2018 at 20:46, Serega Sheypak > wrote: > >> Hi Denis, great to see you here :) >> It works, thanks! >> >> Do you know how spark generates datafile name

Re: Append more files to existing partitioned data

2018-03-17 Thread Serega Sheypak
gt; https://spark.apache.org/docs/latest/sql-programming-guide.html > > Please try SaveMode.Append option. Does it work for you? > > > сб, 17 мар. 2018 г., 15:19 Serega Sheypak : > >> Hi, I', using spark-sql to process my data and store result as parquet >> p

Append more files to existing partitioned data

2018-03-17 Thread Serega Sheypak
Hi, I', using spark-sql to process my data and store result as parquet partitioned by several columns ds.write .partitionBy("year", "month", "day", "hour", "workflowId") .parquet("/here/is/my/dir") I want to run more jobs that will produce new partitions or add more files to existing partiti

Re: how "hour" function in Spark SQL is supposed to work?

2018-03-17 Thread Serega Sheypak
> Not sure why you are dividing by 1000. from_unixtime expects a long type It expects seconds, I have milliseconds. 2018-03-12 6:16 GMT+01:00 vermanurag : > Not sure why you are dividing by 1000. from_unixtime expects a long type > which is time in milliseconds from reference date. > > The foll

how "hour" function in Spark SQL is supposed to work?

2018-03-11 Thread Serega Sheypak
hi, desperately trying to extract hour from unix seconds year, month, dayofmonth functions work as expected. hour function always returns 0. val ds = dataset .withColumn("year", year(to_date(from_unixtime(dataset.col("ts") / 1000 .withColumn("month", month(to_date(from_unixtime(dataset.c

Implement Dataset reader from SEQ file with protobuf to Dataset

2017-10-08 Thread Serega Sheypak
Hi, did anyone try to implement Spark SQL dataset reader from SEQ file with protobuf inside to Dataset? Imagine I have protobuf def Person - name: String - lastName: String - phones: List[String] and generated scala case class: case class Person(name:String, lastName: String, phones: List[Strin

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Serega Sheypak
-comparison-of > > performance of snappy and lzf were on-par to each other. > > Maybe lzf has lower memory requirement. > > On Wed, May 18, 2016 at 7:22 AM, Serega Sheypak > wrote: > >> Switching from snappy to lzf helped me: >> >> *spark.io.compression

Re: Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Serega Sheypak
Switching from snappy to lzf helped me: *spark.io.compression.codec=lzf* Do you know why? :) I can't find exact explanation... 2016-05-18 15:41 GMT+02:00 Ted Yu : > Please increase the number of partitions. > > Cheers > > On Wed, May 18, 2016 at 4:17 AM, Serega Sheyp

Managed memory leak detected.SPARK-11293 ?

2016-05-18 Thread Serega Sheypak
Hi, please have a look at log snippet: 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://mapoutputtrac...@xxx.xxx.xxx.xxx:38128) 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Got the output locations 16/05/18 03:27:16 INFO st

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
ion of your main classname > --master yarn \ > --deploy-mode cluster \ > /home/hadoop/SparkSampleProgram.jar //location of your jar file > > Thanks > Raj > > > > Sent from Yahoo Mail. Get the app <https://yho.com/148vdq> > > > On Tuesday, May 17, 2016 6:03 P

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
spark-submit --conf "spark.driver.userClassPathFirst=true" --class com.MyClass --master yarn --deploy-mode client --jars hdfs:///my-lib.jar,hdfs:///my-seocnd-lib.jar jar-wth-com-MyClass.jar job_params 2016-05-17 15:41 GMT+02:00 Serega Sheypak : > https://issues.apache.org/jira

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
https://issues.apache.org/jira/browse/SPARK-10643 Looks like it's the reason... 2016-05-17 15:31 GMT+02:00 Serega Sheypak : > No, and it looks like a problem. > > 2.2. --master yarn --deploy-mode client > means: > 1. submit spark as yarn app, but spark-driver is started on l

Re: Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
"MySuperSparkJob" and MySuperSparkJob fails because it doesn't get jars, thay are all in HDFS and not accessible from local machine... 2016-05-17 15:18 GMT+02:00 Jeff Zhang : > Do you put your app jar on hdfs ? The app jar must be on your local > machine. > > On Tue, May

Why does spark 1.6.0 can't use jar files stored on HDFS

2016-05-17 Thread Serega Sheypak
hi, I'm trying to: 1. upload my app jar files to HDFS 2. run spark-submit with: 2.1. --master yarn --deploy-mode cluster or 2.2. --master yarn --deploy-mode client specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar When spark job is submitted, SparkSubmit client outputs: Warn

Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Serega Sheypak
ht way to do a read). >> >> What version of Spark and Cassandra-connector are you using? >> Also, what do you get for "select count(*) from foo" -- is that just as >> bad? >> >> On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak > > wrote: >> >&

Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Serega Sheypak
path and this particular table is causing issues or are you trying to > figure out the right way to do a read). > > What version of Spark and Cassandra-connector are you using? > Also, what do you get for "select count(*) from foo" -- is that just as > bad? > > On We

spark-sql estimates Cassandra table with 3 rows as 8 TB of data

2015-06-17 Thread Serega Sheypak
Hi, spark-sql estimated input for Cassandra table with 3 rows as 8 TB. sometimes it's estimated as -167B. I run it on laptop, I don't have 8 TB space for the data.

Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Serega Sheypak
Hi, can somebody suggest me the way to reduce quantity of task? 2015-06-15 18:26 GMT+02:00 Serega Sheypak : > Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each > of them has spark worker. > The problem is that spark runs 869 task to read 3 lines: select

spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-15 Thread Serega Sheypak
Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each of them has spark worker. The problem is that spark runs 869 task to read 3 lines: select bar from foo. I've tried these properties: #try to avoid 769 tasks per dummy select foo from bar qeury spark.cassandra.input.split.si

Re: Driver memory leak?

2015-04-29 Thread Serega Sheypak
>The memory leak could be related to this defect that was resolved in Spark 1.2.2 and 1.3.0. @Sean Will it be backported to CDH? I did't find that bug in CDH 5.4 release notes. 2015-04-29 14:51 GMT+02:00 Conor Fennell : > The memory leak could be

Re: history-server does't read logs which are on FS

2015-04-20 Thread Serega Sheypak
call sc.stop(), it doesn't know that those applications have been stopped. > > Note that in spark 1.3, the history server can also display running > applications (including completed applications, but that it thinks are > still running), which improves things a little bit. > >

history-server does't read logs which are on FS

2015-04-17 Thread Serega Sheypak
Hi, started history-server Here is UI output: - *Event log directory:* file:/var/log/spark/applicationHistory/ No completed applications found! Did you specify the correct logging directory? Please verify your setting of spark.history.fs.logDirectory and whether you have the permissions to a

Spark 1.2, trying to run spark-history as a service, spark-defaults.conf are ignored

2015-04-14 Thread Serega Sheypak
Here is related problem: http://apache-spark-user-list.1001560.n3.nabble.com/Launching-history-server-problem-td12574.html but no answer. What I'm trying to do: wrap spark-history with /etc/init.d script Problems I have: can't make it read spark-defaults.conf I've put this file here: /etc/spark/co