In your first example, the root RDD has 1000 partitions, then you do a
shuffle (with repartitionAndSortWithinPartitions), and shuffles data to
1000 reducers. Then you do coalesce, which asks Spark to launch only 20
reducers to process the data which were prepared for 1 reducers. since
the reduc
how can i get a shuffle with 2048 partitions and 2048 tasks and then a map
phase with 10 partitions and 10 tasks that writes to hdfs?
every time i try to do this using coalesce the shuffle ends up having 10
tasks which is unacceptable due to OOM. this makes coalesce somewhat
useless.
On Wed, Oct
Hi, Holden.
Since that's a performance at 2.4.0, I marked as `Blocker` four days ago.
Bests,
Dongjoon.
On Fri, Oct 12, 2018 at 11:45 AM Holden Karau wrote:
> Following up I just wanted to make sure this new blocker that Dongjoon
> designated is surfaced -
> https://jira.apache.org/jira/browse
Following up I just wanted to make sure this new blocker that Dongjoon
designated is surfaced -
https://jira.apache.org/jira/browse/SPARK-25579?filter=12340409&jql=affectedVersion%20%3D%202.4.0%20AND%20cf%5B12310320%5D%20is%20EMPTY%20AND%20project%20%3D%20spark%20AND%20(status%20%3D%20%22In%20Progr
... sorry for that, but there is a mistake in the second sample, here
is the right one
// fails with either OOM or 'Container killed by YARN for exceeding
memory limits ... spark.yarn.executor.memoryOverhead'
rdd
.map(item => item._1.toString -> item._2.toString)
.repartitionAndSortWithinParti
I'd like to reduce the number of files written to hdfs without
introducing additional shuffles and at the same time to preserve the
stability of the job, and also I'd like to understand why the samples
below work in one case and fail in another one.
Consider the following example which does the sa
Yeah, operator "-" does not seem to be supported, however, you can use
"datediff" function:
In [9]: select datediff(CAST('2000-02-01 12:34:34' AS TIMESTAMP),
CAST('2000-01-01 00:00:00' AS TIMESTAMP))
Out[9]:
+-
I’ll be doing my regular weekly code review at 10am Pacific today -
https://youtu.be/IlH-EGiWXK8 with a look at the current RC, and in the
afternoon at 3pm Pacific I’ll be doing some live coding around WIP graceful
decommissioning PR -
https://youtu.be/4FKuYk2sbQ8
--
Twitter: https://twitter.com/h
Hello Spark Community,
Currently in hive we can do operations on Timestamp Like :
CAST('2000-01-01 12:34:34' AS TIMESTAMP) - CAST('2000-01-01 00:00:00' AS
TIMESTAMP)
Seems its not supporting in spark.
Is there any way available.
Kindly provide some insight on this.
Paras
9130006036
Yea, I thought we are already going to remove this out. +1 for removing it
anyway.
2018년 10월 12일 (금) 오전 1:44, Wenchen Fan 님이 작성:
> Note that, it was deprecated in 2.3.0 already:
> https://spark.apache.org/docs/2.3.0/streaming-flume-integration.html
>
> On Fri, Oct 12, 2018 at 12:46 AM Reynold Xin
Hi,
I use HCatalog Streaming Mutation API to write data to hive transactional
table, and then, I use SparkSQL to read data from the hive transactional table.
I get the right result.
However, SparkSQL uses more time to read hive orc bucket transactional table,
beacause SparkSQL read all columns
11 matches
Mail list logo