Hi, I have billions, potentially dozens of billions of observations. Each
observation is a decimal number.
I need to calculate percentiles 1, 25, 50, 75, 95 for these observations
using Scala Spark. I can use both RDD and Dataset API. Whatever would work
better.
What I can do in terms of perf opti
; you reverse them you'll see 5 output files.
>
> Regards,
> Pietro
>
> Il giorno mer 24 feb 2021 alle ore 16:43 Ivan Petrov
> ha scritto:
>
>> Hi, I'm trying to control the size and/or count of spark output.
>>
>> Here is my code. I expect to
Hi, I'm trying to control the size and/or count of spark output.
Here is my code. I expect to get 5 files but I get dozens of small files.
Why?
dataset
.repartition(5)
.sort("long_repeated_string_in_this_column") // should be better compressed
with snappy
.write
.parquet(outputPath)
's needed for interoperability between
> scala/java.
> If it returns a scala decimal, java code cannot handle it.
>
> If you want a scala decimal, you need to convert it by yourself.
>
> Bests,
> Takeshi
>
> On Wed, Feb 17, 2021 at 9:48 PM Ivan Petrov wrote:
>
>>
Hi, I'm using Spark Scala Dataset API to write spark sql jobs.
I've noticed that Spark dataset accepts scala BigDecimal as the value but
it always returns java.math.BigDecimal when you read it back.
Is it by design?
Should I use java.math.BigDecimal everywhere instead?
Is there any performance pen
Hello Andrey,
you can try to reach Beeline beeline.ru, they use Databricks as far as I
know.
вт, 26 янв. 2021 г. в 15:01, Sean Owen :
> To clarify: Apache projects and the ASF do not provide paid support.
> However there are many vendors who provide distributions of Apache Spark
> who will provi
Would custom accumulator work for you? It should be do-able for
Map[String,Long] too
https://stackoverflow.com/questions/42293798/how-to-create-custom-set-accumulator-i-e-setstring
вс, 17 янв. 2021 г. в 15:16, "Yuri Oleynikov (יורי אולייניקוב)" <
yur...@gmail.com>:
> Hey Jacek, I’ll clar
Hi,
looking for a ready to use docker-container that has inside:
- spark 2.4 or higher
- yarn 2.8.2 or higher
I'm looking for a way to submit spark jobs on yarn.
Nice, thanks!
сб, 5 сент. 2020 г. в 17:42, Sandeep Patra :
> See if this helps: https://spark.apache.org/docs/latest/monitoring.html .
>
> On Sat, Sep 5, 2020 at 8:11 PM Ivan Petrov wrote:
>
>> Hi, is there any API to:
>> - get running tasks for a given Spark Applic
Hi, is there any API to:
- get running tasks for a given Spark Application
- get available executors of a given Spark Application
- kill task or executor?
Hi, I'm feeling pain while trying to insert 2-3 millions of records into
Mongo using plain Spark RDD. There were so many hidden problems.
I would like to avoid this in future and looking for a way to kill
individual spark tasks at specific stage and verify expected behaviour of
my Spark job.
idea
a> y.count*
> *res13: Long = 2*
>
> [image: image.png]
>
> Notice that we were able to skip the first stage because when Stage 11
> looked for it's dependencies it
> found a checkpointed version of the partitioned data so it didn't need to
> repartition again. This
t; Call an action twice. The second run should use the checkpoint.
>
>
>
> On Wed, Aug 19, 2020, 8:49 AM Ivan Petrov wrote:
>
>> i think it returns Unit... it won't work
>> [image: image.png]
>>
>> I found another way to make it work. Called action aft
Hi!
Seems like I do smth wrong. I call .checkpoint() on RDD, but it's not
checkpointed...
What do I do wrong?
val recordsRDD = convertToRecords(anotherRDD)
recordsRDD.checkpoint()
logger.info("checkpoint done")
logger.info(s"isCheckpointed? ${recordsRDD.isCheckpointed},
getCheckpointFile: ${recor
Hi!
i use RDD checkpoint before writing to mongo to avoid duplicate records in
DB. Seems like Driver writes the same data twice in case of task failure.
- data calculated
- mongo _id created
- spark mongo connector writes data to Mongo
- task crashes
- (BOOM!) spark recomputes partition and gets ne
#x27;t toDS() do this without conversion?
>
> On Mon, Jul 13, 2020 at 5:25 PM Ivan Petrov wrote:
> >
> > Hi!
> > I'm trying to understand the cost of RDD to Dataset conversion
> > It takes me 60 minutes to create RDD [MyCaseClass] with 500.000.000.000
> record
Hi!
I'm trying to understand the cost of RDD to Dataset conversion
It takes me 60 minutes to create RDD [MyCaseClass] with 500.000.000.000
records
It takes around 15 minutes to convert them to Dataset[MyCaseClass]
The shema of MyCaseClass is
str01: String,
str02: String,
str03: String,
str04: Strin
Hi there!
I'm seeing this exception in Spark Driver log.
Executor log stays empty. No exceptions, nothing.
8 tasks out of 402 failed with this exception.
What is the right way to debug it?
Thank you.
I see that
spark/jars -> minlog-1.3.0.jar
is in driver classpath at least...
java.lang.NoClas
spark/jars -> minlog-1.3.0.jar
I see that jar is there. What do I do wrong?
чт, 9 июл. 2020 г. в 20:43, Ivan Petrov :
> Hi there!
> I'm seeing this exception in Spark Driver log.
> Executor log stays empty. No exceptions, nothing.
> 8 tasks out of 402 failed with this exce
Hi there!
I'm seeing this exception in Spark Driver log.
Executor log stays empty. No exceptions, nothing.
8 tasks out of 402 failed with this exception.
What is the right way to debug it?
Thank you.
java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log
at
com.esotericsoftware.kryo.seria
20 matches
Mail list logo