How to calculate percentiles in Scala Spark 2.4.x

2021-04-27 Thread Ivan Petrov
Hi, I have billions, potentially dozens of billions of observations. Each observation is a decimal number. I need to calculate percentiles 1, 25, 50, 75, 95 for these observations using Scala Spark. I can use both RDD and Dataset API. Whatever would work better. What I can do in terms of perf opti

Re: How to control count / size of output files for

2021-02-25 Thread Ivan Petrov
; you reverse them you'll see 5 output files. > > Regards, > Pietro > > Il giorno mer 24 feb 2021 alle ore 16:43 Ivan Petrov > ha scritto: > >> Hi, I'm trying to control the size and/or count of spark output. >> >> Here is my code. I expect to

How to control count / size of output files for

2021-02-24 Thread Ivan Petrov
Hi, I'm trying to control the size and/or count of spark output. Here is my code. I expect to get 5 files but I get dozens of small files. Why? dataset .repartition(5) .sort("long_repeated_string_in_this_column") // should be better compressed with snappy .write .parquet(outputPath)

Re: Spark SQL Dataset and BigDecimal

2021-02-18 Thread Ivan Petrov
's needed for interoperability between > scala/java. > If it returns a scala decimal, java code cannot handle it. > > If you want a scala decimal, you need to convert it by yourself. > > Bests, > Takeshi > > On Wed, Feb 17, 2021 at 9:48 PM Ivan Petrov wrote: > >>

Spark SQL Dataset and BigDecimal

2021-02-17 Thread Ivan Petrov
Hi, I'm using Spark Scala Dataset API to write spark sql jobs. I've noticed that Spark dataset accepts scala BigDecimal as the value but it always returns java.math.BigDecimal when you read it back. Is it by design? Should I use java.math.BigDecimal everywhere instead? Is there any performance pen

Re: Apache Spark

2021-01-26 Thread Ivan Petrov
Hello Andrey, you can try to reach Beeline beeline.ru, they use Databricks as far as I know. вт, 26 янв. 2021 г. в 15:01, Sean Owen : > To clarify: Apache projects and the ASF do not provide paid support. > However there are many vendors who provide distributions of Apache Spark > who will provi

Re: Dynamic Spark metrics creation

2021-01-17 Thread Ivan Petrov
Would custom accumulator work for you? It should be do-able for Map[String,Long] too https://stackoverflow.com/questions/42293798/how-to-create-custom-set-accumulator-i-e-setstring ‪вс, 17 янв. 2021 г. в 15:16, ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ < yur...@gmail.com>:‬ > Hey Jacek, I’ll clar

Is there any good Docker container / compose with spark 2.4+ and YARN 2.8.2+

2020-09-16 Thread Ivan Petrov
Hi, looking for a ready to use docker-container that has inside: - spark 2.4 or higher - yarn 2.8.2 or higher I'm looking for a way to submit spark jobs on yarn.

Re: Spark Application REST API, looking for a way to kill specific task or executor

2020-09-05 Thread Ivan Petrov
Nice, thanks! сб, 5 сент. 2020 г. в 17:42, Sandeep Patra : > See if this helps: https://spark.apache.org/docs/latest/monitoring.html . > > On Sat, Sep 5, 2020 at 8:11 PM Ivan Petrov wrote: > >> Hi, is there any API to: >> - get running tasks for a given Spark Applic

Spark Application REST API, looking for a way to kill specific task or executor

2020-09-05 Thread Ivan Petrov
Hi, is there any API to: - get running tasks for a given Spark Application - get available executors of a given Spark Application - kill task or executor?

Some sort of chaos monkey for spark jobs, do we have it?

2020-08-27 Thread Ivan Petrov
Hi, I'm feeling pain while trying to insert 2-3 millions of records into Mongo using plain Spark RDD. There were so many hidden problems. I would like to avoid this in future and looking for a way to kill individual spark tasks at specific stage and verify expected behaviour of my Spark job. idea

Re: RDD which was checkpointed is not checkpointed

2020-08-19 Thread Ivan Petrov
a> y.count* > *res13: Long = 2* > > [image: image.png] > > Notice that we were able to skip the first stage because when Stage 11 > looked for it's dependencies it > found a checkpointed version of the partitioned data so it didn't need to > repartition again. This

Re: RDD which was checkpointed is not checkpointed

2020-08-19 Thread Ivan Petrov
t; Call an action twice. The second run should use the checkpoint. > > > > On Wed, Aug 19, 2020, 8:49 AM Ivan Petrov wrote: > >> i think it returns Unit... it won't work >> [image: image.png] >> >> I found another way to make it work. Called action aft

RDD which was checkpointed is not checkpointed

2020-08-19 Thread Ivan Petrov
Hi! Seems like I do smth wrong. I call .checkpoint() on RDD, but it's not checkpointed... What do I do wrong? val recordsRDD = convertToRecords(anotherRDD) recordsRDD.checkpoint() logger.info("checkpoint done") logger.info(s"isCheckpointed? ${recordsRDD.isCheckpointed}, getCheckpointFile: ${recor

Is there any possibility to avoid double computation in case of RDD checkpointing

2020-08-16 Thread Ivan Petrov
Hi! i use RDD checkpoint before writing to mongo to avoid duplicate records in DB. Seems like Driver writes the same data twice in case of task failure. - data calculated - mongo _id created - spark mongo connector writes data to Mongo - task crashes - (BOOM!) spark recomputes partition and gets ne

Re: scala RDD[MyCaseClass] to Dataset[MyCaseClass] perfomance

2020-07-13 Thread Ivan Petrov
#x27;t toDS() do this without conversion? > > On Mon, Jul 13, 2020 at 5:25 PM Ivan Petrov wrote: > > > > Hi! > > I'm trying to understand the cost of RDD to Dataset conversion > > It takes me 60 minutes to create RDD [MyCaseClass] with 500.000.000.000 > record

scala RDD[MyCaseClass] to Dataset[MyCaseClass] perfomance

2020-07-13 Thread Ivan Petrov
Hi! I'm trying to understand the cost of RDD to Dataset conversion It takes me 60 minutes to create RDD [MyCaseClass] with 500.000.000.000 records It takes around 15 minutes to convert them to Dataset[MyCaseClass] The shema of MyCaseClass is str01: String, str02: String, str03: String, str04: Strin

sparksql 2.4.0 java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log

2020-07-09 Thread Ivan Petrov
Hi there! I'm seeing this exception in Spark Driver log. Executor log stays empty. No exceptions, nothing. 8 tasks out of 402 failed with this exception. What is the right way to debug it? Thank you. I see that spark/jars -> minlog-1.3.0.jar is in driver classpath at least... java.lang.NoClas

Re: sparksql 2.4.0 java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log

2020-07-09 Thread Ivan Petrov
spark/jars -> minlog-1.3.0.jar I see that jar is there. What do I do wrong? чт, 9 июл. 2020 г. в 20:43, Ivan Petrov : > Hi there! > I'm seeing this exception in Spark Driver log. > Executor log stays empty. No exceptions, nothing. > 8 tasks out of 402 failed with this exce

sparksql 2.4.0 java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log

2020-07-09 Thread Ivan Petrov
Hi there! I'm seeing this exception in Spark Driver log. Executor log stays empty. No exceptions, nothing. 8 tasks out of 402 failed with this exception. What is the right way to debug it? Thank you. java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log at com.esotericsoftware.kryo.seria