There is a very good reason for this. It is recommended using k8s that you set
memory request and limit to the same value, set a cpu request, but not a cpu
limit. More info here https://home.robusta.dev/blog/kubernetes-memory-limit
BR, Martin
From: Mich
t; The issue is here there is no parameter to set executor pod request memory
> value.
> Currently we have only one parameter which is spark.executor.memory and it
> set pod resources limit and requests.
>
> Mich Talebzadeh , 10 Mar 2023 Cum, 22:04
> tarihinde şunu yazdı:
>
>&
Hi Mich,
The issue is here there is no parameter to set executor pod request memory
value.
Currently we have only one parameter which is spark.executor.memory and it
set pod resources limit and requests.
Mich Talebzadeh , 10 Mar 2023 Cum, 22:04
tarihinde şunu yazdı:
> Yes, both EKS and
eStep.scala#L194
>>
>> .editOrNewResources()
>> .addToRequests("memory", executorMemoryQuantity)
>> .addToLimits("memory", executorMemoryQuantity)
>> .addToRequests("cpu", executorCpuQuantity)
>> .addToLimits(executorResourceQuantities.asJava)
>
ail Yenigul
wrote:
> and If you look at the code
>
>
> https://github.com/apache/spark/blob/e64262f417bf381bdc664dfd1cbcfaa5aa7221fe/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L194
>
> .editOrNewResourc
rce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L194
>
> .editOrNewResources()
> .addToRequests("memory", executorMemoryQuantity)
> .addToLimits("memory", executorMemoryQuantity)
> .addToRequ
and If you look at the code
https://github.com/apache/spark/blob/e64262f417bf381bdc664dfd1cbcfaa5aa7221fe/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L194
.editOrNewResources()
.addToRequests("m
2023 at 17:39, Ismail Yenigul
> wrote:
>
>> Hi,
>>
>> There is a cpu parameter to set spark executor on k8s
>> spark.kubernetes.executor.limit.cores and
>> spark.kubernetes.executor.request.cores
>> but there is no parameter to set memory request differen
ary damages arising from
such loss, damage or destruction.
On Fri, 10 Mar 2023 at 17:39, Ismail Yenigul
wrote:
> Hi,
>
> There is a cpu parameter to set spark executor on k8s
> spark.kubernetes.executor.limit.cores and
> spark.kubernetes.executor.request.cores
> but there
Hi,
There is a cpu parameter to set spark executor on k8s
spark.kubernetes.executor.limit.cores and
spark.kubernetes.executor.request.cores
but there is no parameter to set memory request different then limits
memory (such as spark.kubernetes.executor.request.memory)
For that reason
Hi, vtygoss
In my memory, the memoryOverhead in Spark 2.3 includes all the memories that
are not executor onHeap memory, including the memory used by Spark
offheapMemoryPool(executorOffHeapMemory, this concept also exists in Spark
2.3), PySparkWorker, PipeRDD used, netty memory pool, JVM
Hi, community!
I notice a change about the memory module of yarn container between spark-2.3.0
and spark-3.2.1 when requesting containers from yarn.
org.apache.spark.deploy.yarn.Client.java # verifyClusterResources
```
spark-2.3.0
val executorMem = executorMemory + executorMemoryOverhead
Following from the above, I did some tests on this with leaving 1 VCPU out
of 4 VCPUS to the OS on each node (Container & executor) in three node GKE
cluster. The RAM allocated to each node was 16GB. I then set the initial
container AND executor (memory 10 10% of RAM) and incremented thes
Thanks for the suggestions. I suppose I should share a bit more about what
I tried/learned, so others who come later can understand why a
memory-efficient, exact median is not in Spark.
Spark's own ApproximatePercentile also uses QuantileSummaries internally
<https://github.com/apache/sp
centile expression
> <https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39>
> highlights that it's very memory-intensive and can easily lead to
> OutOfM
trying to create a new aggregate function. It's my first time
>>>>> working with Catalyst, so it's exciting---but I'm also in a bit over my
>>>>> head.
>>>>>
>>>>> My goal is to create a function to calculate the median
&g
he median
>>>> <https://issues.apache.org/jira/browse/SPARK-26589>.
>>>>
>>>> As a very simple solution, I could just define median to be an alias of
>>>> `Percentile(col, 0.5)`. However, the leading comment on the Percentile
>>>> expression
&
Hi,
I have a three node k8s cluster (GKE) in Google cloud with E2
standard machines that have 4 GB of system memory per VCPU giving 4 VPCU
and 16,384MB of RAM.
An optimum sizing of the number of executors, CPU and memory allocation is
important here. These are the assumptions:
1. You want to
alias of
>>> `Percentile(col,
>>> 0.5)`. However, the leading comment on the Percentile expression
>>> <https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentil
leading comment on the Percentile expression (
>> https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39
>> ) highlights that it's very memory-intensive and
ercentile(col,
> 0.5)`. However, the leading comment on the Percentile expression
> <https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39>
> highlight
gate/Percentile.scala#L37-L39>
highlights that it's very memory-intensive and can easily lead to
OutOfMemory errors.
So instead of using Percentile, I'm trying to create an Expression that
calculates the median without needing to hold everything in memory at once.
I'm considering t
Hello all,
We've been working with PySpark and Pandas, and have found that to
convert a dataset using N bytes of memory to Pandas, we need to have
2N bytes free, even with the Arrow optimization enabled. The
fundamental reason is ARROW-3789[1]: Arrow does not free the Arrow
table until conve
Subject: Re: [Spark Core] Merging PR #23340 for New Executor Memory Metrics
HI, Alex and Michel.
I removed the `Stale` label and reopened it for now. You may want to ping the
original author because the last update of that PR is one year ago and has many
conflicts as of today.
Bests,
Dongjoo
t; *From:* Michel Sumbul
> *Sent:* Thursday, June 25, 2020 11:48 AM
> *To:* dev@spark.apache.org ; Alex Scammon <
> alex.scam...@ext.gresearch.co.uk>
> *Subject:* Re: [Spark Core] Merging PR #23340 for New Executor Memory
> Metrics
>
>
> Hey Dev team,
>
> I agreed wit
AM
To: dev@spark.apache.org ; Alex Scammon
Subject: Re: [Spark Core] Merging PR #23340 for New Executor Memory Metrics
Hey Dev team,
I agreed with Alex, theses metrics can be really usefull to tune jobs.
Any chances someone can have a look at it?
Thanks,
Michel
Le lundi 22 juin 2020 à 22:48:23
Hi there devs,
Congrats on Spark 3.0.0, that's great to see.
I'm hoping to get some eyes on something old, however:
* https://github.com/apache/spark/pull/23340
I'm really just trying to get some eyes on this PR and see if we can still move
it forward. I reached out to the reviewers of th
Hi all,
I am new to the Spark community. Please ignore if this question doesn't make
sense.
My PySpark Dataframe is just taking a fraction of time (in ms) in 'Sorting',
but moving data is much expensive (> 14 sec).
Explanation:
I have a huge Arrow RecordBatches collection which is equally
Hi all:
i want to ask a question about the metrics to show the executor is fully used
the memory. in the log i always saw following in the log, i guess this means
i did not fully use the executor 's memory.
but i don't want to open the log to view, is there any metrics to sho
Hi All,
I am getting Out Of Memory due to GC overhead while reading a table from
HIVE from spark like:
spark.sql("SELECT * FROM some.table where date='2019-05-14' LIMIT
> 10").show()
So when I run above command in spark-shell then it starts processing *1780
tasks*
Hi all,
I've been hitting this issue, and hoping to get some traction going at:
https://issues.apache.org/jira/browse/SPARK-21492
and PR: https://github.com/apache/spark/pull/23762
If SortMergeJoinScanner doesn't consume the iterator from
UnsafeExternalRowSorter entirely, the m
Hello,
I am sorry about my first explanation, was not concrete. Well I
will explain further about TaskMemoryManager. TaskMemoryManager manages
the execution memory of each task application as follow:
1. MemoryConsumer is the entry for the Spark task to run.
MemoryConsumer requests
what do you mean by ''Tungsten Consumer"?
On Fri, Feb 8, 2019 at 6:11 PM Jack Kolokasis
wrote:
> Hello all,
> I am studying about Tungsten Project and I am wondering when Spark
> creates a Tungsten consumer. While I am running some applications, I see
> that Spark creates Tungsten Consumer
Hello all,
I am studying about Tungsten Project and I am wondering when Spark
creates a Tungsten consumer. While I am running some applications, I see
that Spark creates Tungsten Consumer while in other applications not
(using the same configuration). When does this happens ?
I am looking
I believe I have uncovered a strange interaction between pySpark, Numpy and
Python which produces a memory leak. I wonder if anyone has any ideas of
what the issue could be?
I have the following minimal working example ( gist of code
<https://gist.github.com/jos
ppreciated!
>>
>> Peter
>>
>> On Fri, Oct 19, 2018 at 9:38 AM Peter Rudenko
>> wrote:
>>
>>> Hey Peter, in SparkRDMA shuffle plugin (
>>> https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle
>>> file, to do Remote Dir
t 19, 2018 at 9:38 AM Peter Rudenko
> wrote:
>
>> Hey Peter, in SparkRDMA shuffle plugin (
>> https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file,
>> to do Remote Direct Memory Access. If the shuffle data is bigger then RAM,
>> Mellanox NIC
:
> Hey Peter, in SparkRDMA shuffle plugin (
> https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file,
> to do Remote Direct Memory Access. If the shuffle data is bigger then RAM,
> Mellanox NIC support On Demand Paging, where OS invalidates translations
> which are
Hey Peter, in SparkRDMA shuffle plugin (
https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file, to
do Remote Direct Memory Access. If the shuffle data is bigger then RAM,
Mellanox NIC support On Demand Paging, where OS invalidates translations
which are no longer valid d
I would be very interested in the initial question here:
is there a production level implementation for memory only shuffle and
configurable (similar to MEMORY_ONLY storage level, MEMORY_OR_DISK
storage level) as mentioned in this ticket,
https://github.com/apache/spark/pull/5403 ?
It would be
Hello,
I recently start studying the Spark's memory management system.
More spesifically I want to understand how spark use the off-Heap memory.
Interanlly I saw, that there are two types of offHeap memory.
(offHeapExecutionMemoryPool and offHeapStorageMemoryPool).
How Spark us
s._2.split("[^A-Za-z']+".replaceAll("""\n""","
")))
Thanks
On Sat, Aug 25, 2018 at 3:38 PM Chetan Khatri
wrote:
> Hello Spark Dev Community,
>
> Friend of mine is facing issue while reading 20 GB of log files from
> Directory on Cl
Hello Spark Dev Community,
Friend of mine is facing issue while reading 20 GB of log files from
Directory on Cluster.
Approach are as below:
*1. This gives out of memory error.*
val logRDD =
sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*")
val
Hello,
I recently start studying the Spark's memory management system. My
question is about the offHeapExecutionMemoryPool and
offHeapStorageMemoryPool.
1. How Spark use the offHeapExecutionMemoryPool ?
2. How use the offHeap memory (I understand the allocation side),
but
*I’ve been looking at where untracked memory is getting used in spark,
especially offheap memory, and I’ve discovered some things I’d like to
share with the community. Most of what I’ve learned has been about the way
spark is using netty -- I’ll go into some more detail about that below. I’m
also
In fact not all tasks belong to the same stage. Thus, per task may be is
deferent for the dependence of memory. For example, the executor
are running two tasks(A and B), and the ExecutionMemoryPool own 1000M. We
can hope the task-A occupy 900M, and task-B occupy 100M due to the task-A
need much
it was helpful,
Then, the OS needs to fill some pressure from the applications
requesting memory to free some memory cache?
Exactly under which circumstances the OS free that memory to give it to
applications requesting it?
I mean if the total memory is 16GB and 10GB are used for OS cache
Hi,
When I issue a "free -m" command in a host, I see a lot of memory used
for cache in OS, however Spark Streaming is not able to request that
memory for its usage, and it fail the execution due to not been able to
launch executors.
What I understand of the OS memory cache (
Hi All,
We are running spark 2.1.1 on Hadoop YARN 2.6.5.
We found the pyspark.daemon process consume more than 300GB memory.
However, according to
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals, the
daemon process shouldn't have this problem.
Also, we find the d
Hi there,
A while ago running GraphX jobs I've discovered that
PeriodicRDDCheckpointer fails with FileNotFoundException's in case of
insufficient memory resources.
I believe that any iterative job which uses PeriodicRDDCheckpointer
(like ML) suffers from the same issue (but it
implemented for this too. I agree that serializing the
data to a pandas dataframe or numpy array is faster and less memory
intensive.
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
-
To unsubscribe e
Please send a PR. Thanks for looking at this.
On Thu, Nov 16, 2017 at 7:27 AM Andrew Andrade
wrote:
> Hello devs,
>
> I know a lot of great work has been done recently with pandas to spark
> dataframes and vice versa using Apache Arrow, but I faced a specific pain
> point on a l
Hello devs,
I know a lot of great work has been done recently with pandas to spark
dataframes and vice versa using Apache Arrow, but I faced a specific pain
point on a low memory setup without Arrow.
Specifically I was finding a driver OOM running a toPandas on a small
dataset (<100
Hi all,
could someone please help me understand the broadcast life cycle in detail,
especially with regard to memory management?
After reading through the TorrentBroadcast implementation, it seems that
for every broadcast object, the driver holds a strong reference to a
shallow copy (in
Thanks. This is an important direction to explore and my apologies for the
late reply.
One thing that is really hard about this is that with different layers of
abstractions, we often use other libraries that might allocate large amount
of memory (e.g. snappy library, Parquet itself), which makes
Thanks Holden !
On Thu, Aug 3, 2017 at 4:02 AM, Holden Karau wrote:
> The memory overhead is based less on the total amount of data and more on
> what you end up doing with the data (e.g. if your doing a lot of off-heap
> processing or using Python you need to increase it). Hone
The memory overhead is based less on the total amount of data and more on
what you end up doing with the data (e.g. if your doing a lot of off-heap
processing or using Python you need to increase it). Honestly most people
find this number for their job "experimentally" (e.g. they
spark.memory.fraction setting.
number of partitions = 674
Cluster: 455 GB total memory, VCores: 288, Nodes: 17
Given / tried memory config: executor-mem = 16g, num-executor=10, executor
cores=6, driver mem=4g
spark.default.parallelism=1000
spark.sql.shuffle.partitions=1000
spark.yarn.executor.memoryOverhead
en columns you
normally use to filter when reading the table. I generally recommend the
second approach because it handles skew and prepares the data for more
efficient reads.
If that doesn't help, then you should look at your memory settings. When
you're getting killed by YARN, you sho
ntainer killed by YARN for exceeding memory limits. 14.0
> GB of 14 GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
>
> Driver memory=4g, executor mem=12g, num-executors=8, executor core=8
>
> Do you think below setting can help me to
1. For executor memory, we have spark.executor.memory for heap size, and
spark.memory.offHeap.size for off-heap size, and these 2 together is the total
memory consumption for each executor process.
From the user side, what they always care is the total memory consumption, no
matter it is on
https://issues.apache.org/jira/browse/SPARK-21157
Hi - often times, Spark applications are killed for overrunning available
memory by YARN, Mesos, or the OS. In SPARK-21157, I propose a design for
grabbing and reporting "total memory" usage for Spark executors - that is,
memory usage
Hi Naga
Is it failing because of driver memory full or executor memory full ?
can you please try setting this property spark.cleaner.ttl ? . So that
older RDDs /metadata should also get clear automatically.
Can you please provide the complete error stacktrace and code snippet ?.
Regards
Hi,
I am trying to load 1.6 mb excel file which has 16 tabs. We converted excel
to csv and loaded 16 csv files to 8 tables. Job was running successful in
1st run in pyspark. When trying to run the same job 2 time, container
getting killed due to memory issues.
I am using unpersist and clearcache
ocess(--executor-memory 30G), as follow:
test@test Online ~ $ ps aux | grep CoarseGrainedExecutorBackend
test 105371 106 21.5 67325492 42621992 ? Sl 15:20 55:14
/home/test/service/jdk/bin/java -cp
/home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/serv
This isn't related to the progress bar, it just happened while in that
section of code. Something else is taking memory in the driver, usually a
broadcast table or something else that requires a lot of memory and happens
on the driver.
You should check your driver memory settings and the
the spark version is 2.1.0
--发件人:方孝健(玄弟)
发送时间:2017年2月10日(星期五) 12:35收件人:spark-dev
; spark-user 主 题:Driver hung and
happend out of memory while writing to console progress bar
[Stage 172
[Stage 172:==> (10328 + 93) / 16144]
[Stage 172:==> (10329 + 93) / 16144]
[Stage 172:==> (10330 + 93) / 16144]
[Stage 172:==>
ate like this:
<http://apache-spark-developers-list.1001551.n3.nabble.com/file/n20881/QQ20170207-212340.png>
The exceed off-heap memory may be caused by these abnormal threads.
This problem occurs only when writing data to the Hadoop(tasks may be killed
by Executor during writing).
Could
Hi,
Just to throw few zlotys to the conversation, I believe that Spark
Standalone does not enforce any memory checks to limit or even kill
executors beyond requested memory (like YARN). I also found that
memory does not have much of use while scheduling tasks and CPU
matters only.
My
an 22, 2017, at 11:36 PM, StanZhai <
> mail@
> > wrote:
>>
>> I'm using Parallel GC.
>> rxin wrote
>>> Are you using G1 GC? G1 sometimes uses a lot more memory than the size
>>> allocated.
>>>
>>>
>>> On Sun, Jan 22, 201
t;>
>>>>> BTW, we still can create the regular data source tables and insert the
>>>>> data into the tables. The major difference is whether the metadata is
>>>>> persistently stored or not.
>>>>>
>>>>> Thanks,
>&g
Hi Stan,
What OS/version are you using?
Michael
> On Jan 22, 2017, at 11:36 PM, StanZhai wrote:
>
> I'm using Parallel GC.
> rxin wrote
>> Are you using G1 GC? G1 sometimes uses a lot more memory than the size
>> allocated.
>>
>>
>> On Sun,
t;>>
>>>> Xiao Li
>>>>
>>>> 2017-01-22 11:14 GMT-08:00 Reynold Xin :
>>>>
>>>> I think this is something we are going to change to completely decouple
>>>> the Hive support and catalog.
>>>>
>>>>
>>>>
difference is whether the metadata is
>>> persistently stored or not.
>>>
>>> Thanks,
>>>
>>> Xiao Li
>>>
>>> 2017-01-22 11:14 GMT-08:00 Reynold Xin :
>>>
>>> I think this is something we are going to ch
I'm using Parallel GC.
rxin wrote
> Are you using G1 GC? G1 sometimes uses a lot more memory than the size
> allocated.
>
>
> On Sun, Jan 22, 2017 at 12:58 AM StanZhai <
> mail@
> > wrote:
>
>> Hi all,
>>
>>
>>
>> We j
could this be related to SPARK-18787?
On Sun, Jan 22, 2017 at 1:45 PM, Reynold Xin wrote:
> Are you using G1 GC? G1 sometimes uses a lot more memory than the size
> allocated.
>
>
> On Sun, Jan 22, 2017 at 12:58 AM StanZhai wrote:
>
>> Hi all,
>>
>>
>>
>
>> 2017-01-22 11:14 GMT-08:00 Reynold Xin :
>>
>> I think this is something we are going to change to completely decouple
>> the Hive support and catalog.
>>
>>
>> On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin wrote:
>>
>> Hi all,
>>
>&g
Hive support and catalog.
>
>
> On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin wrote:
>
> Hi all,
>
> Currently when the in-memory catalog is used, e.g. through `--conf
> spark.sql.catalogImplementation=in-memory`, we can create a persistent
> table, but inserting into th
te:
>
>> Hi all,
>>
>> Currently when the in-memory catalog is used, e.g. through `--conf
>> spark.sql.catalogImplementation=in-memory`, we can create a persistent
>> table, but inserting into this table would fail with error message "Hive
>> support is required to inser
I think this is something we are going to change to completely decouple the
Hive support and catalog.
On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin wrote:
> Hi all,
>
> Currently when the in-memory catalog is used, e.g. through `--conf
> spark.sql.catalogImplementation=in-memory`, we
Are you using G1 GC? G1 sometimes uses a lot more memory than the size
allocated.
On Sun, Jan 22, 2017 at 12:58 AM StanZhai wrote:
> Hi all,
>
>
>
> We just upgraded our Spark from 1.6.2 to 2.1.0.
>
>
>
> Our Spark application is started by spark-submit with confi
Hi all,
Currently when the in-memory catalog is used, e.g. through `--conf
spark.sql.catalogImplementation=in-memory`, we can create a persistent
table, but inserting into this table would fail with error message "Hive
support is required to insert into the following tables..".
s
Hi all,
We just upgraded our Spark from 1.6.2 to 2.1.0.
Our Spark application is started by spark-submit with config of
`--executor-memory 35G` in standalone model, but the actual use of memory up
to 65G after a full gc(jmap -histo:live $pid) as follow:
test@c6 ~ $ ps aux | grep
Hi Nick,
The scope of the PR I submitted is reduced because we can't make sure if it
is really the root cause of the error you faced. You can check out the
discussion on the PR. So I can just change the assert in the code as shown
in the PR.
If you can have a repro, we can go back to see if it i
Subject:Re: Reduce memory usage of UnsafeInMemorySorter
Unfortunately, I don't have a repro, and I'm only seeing this at scale.
But I was able to get around the issue by fiddling with the distribution
of my data before asking GraphFrames to process it. (I think that's where
.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L156
>
> Regards,
> Kazuaki Ishizaki
>
>
>
> From:Reynold Xin
> To:Nicholas Chammas
> Cc:Spark dev list
> D
org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L156
Regards,
Kazuaki Ishizaki
From: Reynold Xin
To: Nicholas Chammas
Cc: Spark dev list
Date: 2016/12/07 14:27
Subject:Re: Reduce memory usage of UnsafeInMemorySorter
This is not supposed to happen. Do
ensure that hasSpaceForAnotherRecord() returns a true value?
>
> Do I need:
>
> - More, smaller partitions?
>- More memory per executor?
>- Some Java or Spark option enabled?
>- etc.
>
> I’m running Spark 2.0.2 on Java 7 and YARN. Would Java 8 help here?
> (
refined question now: How can I ensure that
UnsafeInMemorySorter has room to insert new records? In other words, how
can I ensure that hasSpaceForAnotherRecord() returns a true value?
Do I need:
- More, smaller partitions?
- More memory per executor?
- Some Java or Spark option enabled
+Cheng
Hi Reynold,
I think you are referring to bucketing in in-memory columnar cache.
I am proposing that if we have a parquet structure like following :-
//file1/id=1/
//file1/id=2/
and if we read and cache it, it should create 2 RDD[CachedBatch] (each per
value of "id")
Is thi
It's already there isn't it? The in-memory columnar cache format.
On Thu, Nov 24, 2016 at 9:06 PM, Nitin Goyal wrote:
> Hi,
>
> Do we have any plan of supporting parquet-like partitioning support in
> Spark SQL in-memory cache? Something like one RDD[CachedBatch]
Hi,
Do we have any plan of supporting parquet-like partitioning support in
Spark SQL in-memory cache? Something like one RDD[CachedBatch] per
in-memory cache partition.
-Nitin
cham...@gmail.com> wrote:
>
> I'm also curious about this. Is there something we can do to help
> troubleshoot these leaks and file useful bug reports?
>
> On Wed, Oct 12, 2016 at 4:33 PM vonnagy wrote:
>
> I am getting excessive memory leak warnings when running multipl
ot these leaks and file useful bug reports?
>
> On Wed, Oct 12, 2016 at 4:33 PM vonnagy wrote:
>
>> I am getting excessive memory leak warnings when running multiple mapping
>> and
>> aggregations and using DataSets. Is there anything I should be looking for
>&
I'm also curious about this. Is there something we can do to help
troubleshoot these leaks and file useful bug reports?
On Wed, Oct 12, 2016 at 4:33 PM vonnagy wrote:
> I am getting excessive memory leak warnings when running multiple mapping
> and
> aggregations and using Data
with predError.zip(input) ,we get RDD data, so we can just do a sample on
predError or input, if so, we can't use zip(the elements number must be the
same in each partition),thank you!
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-m
st]";;
发送时间: 2016年11月16日(星期三) 凌晨3:54
收件人: "WangJianfei";
主题: Re: Reduce the memory usage if we do same first inGradientBoostedTrees if
subsamplingRate< 1.0
Thanks for the suggestion. That would be faster, but less accurate in
most cases. It's generally bet
; wrote:
> when we train the mode, we will use the data with a subSampleRate, so if
> the
> subSampleRate < 1.0 , we can do a sample first to reduce the memory usage.
> se the code below in GradientBoostedTrees.boost()
>
> while (m < numIterations && !doneLearning)
when we train the mode, we will use the data with a subSampleRate, so if the
subSampleRate < 1.0 , we can do a sample first to reduce the memory usage.
se the code below in GradientBoostedTrees.boost()
while (m < numIterations && !doneLearning) {
// Update data with pseudo-
1 - 100 of 342 matches
Mail list logo