Re: spark executor pod has same memory value for request and limit

2023-03-14 Thread Martin Andersson
There is a very good reason for this. It is recommended using k8s that you set memory request and limit to the same value, set a cpu request, but not a cpu limit. More info here https://home.robusta.dev/blog/kubernetes-memory-limit BR, Martin From: Mich

Re: spark executor pod has same memory value for request and limit

2023-03-10 Thread Mich Talebzadeh
t; The issue is here there is no parameter to set executor pod request memory > value. > Currently we have only one parameter which is spark.executor.memory and it > set pod resources limit and requests. > > Mich Talebzadeh , 10 Mar 2023 Cum, 22:04 > tarihinde şunu yazdı: > >&

Re: spark executor pod has same memory value for request and limit

2023-03-10 Thread Ismail Yenigul
Hi Mich, The issue is here there is no parameter to set executor pod request memory value. Currently we have only one parameter which is spark.executor.memory and it set pod resources limit and requests. Mich Talebzadeh , 10 Mar 2023 Cum, 22:04 tarihinde şunu yazdı: > Yes, both EKS and

Re: spark executor pod has same memory value for request and limit

2023-03-10 Thread Ismail Yenigul
eStep.scala#L194 >> >> .editOrNewResources() >> .addToRequests("memory", executorMemoryQuantity) >> .addToLimits("memory", executorMemoryQuantity) >> .addToRequests("cpu", executorCpuQuantity) >> .addToLimits(executorResourceQuantities.asJava) >

Re: spark executor pod has same memory value for request and limit

2023-03-10 Thread Mich Talebzadeh
ail Yenigul wrote: > and If you look at the code > > > https://github.com/apache/spark/blob/e64262f417bf381bdc664dfd1cbcfaa5aa7221fe/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L194 > > .editOrNewResourc

Re: spark executor pod has same memory value for request and limit

2023-03-10 Thread Bjørn Jørgensen
rce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L194 > > .editOrNewResources() > .addToRequests("memory", executorMemoryQuantity) > .addToLimits("memory", executorMemoryQuantity) > .addToRequ

Re: spark executor pod has same memory value for request and limit

2023-03-10 Thread Ismail Yenigul
and If you look at the code https://github.com/apache/spark/blob/e64262f417bf381bdc664dfd1cbcfaa5aa7221fe/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L194 .editOrNewResources() .addToRequests("m

Re: spark executor pod has same memory value for request and limit

2023-03-10 Thread Ismail Yenigul
2023 at 17:39, Ismail Yenigul > wrote: > >> Hi, >> >> There is a cpu parameter to set spark executor on k8s >> spark.kubernetes.executor.limit.cores and >> spark.kubernetes.executor.request.cores >> but there is no parameter to set memory request differen

Re: spark executor pod has same memory value for request and limit

2023-03-10 Thread Mich Talebzadeh
ary damages arising from such loss, damage or destruction. On Fri, 10 Mar 2023 at 17:39, Ismail Yenigul wrote: > Hi, > > There is a cpu parameter to set spark executor on k8s > spark.kubernetes.executor.limit.cores and > spark.kubernetes.executor.request.cores > but there

spark executor pod has same memory value for request and limit

2023-03-10 Thread Ismail Yenigul
Hi, There is a cpu parameter to set spark executor on k8s spark.kubernetes.executor.limit.cores and spark.kubernetes.executor.request.cores but there is no parameter to set memory request different then limits memory (such as spark.kubernetes.executor.request.memory) For that reason

Re: memory module of yarn container

2022-08-25 Thread Yang,Jie(INF)
Hi, vtygoss In my memory, the memoryOverhead in Spark 2.3 includes all the memories that are not executor onHeap memory, including the memory used by Spark offheapMemoryPool(executorOffHeapMemory, this concept also exists in Spark 2.3), PySparkWorker, PipeRDD used, netty memory pool, JVM

memory module of yarn container

2022-08-25 Thread vtygoss
Hi, community! I notice a change about the memory module of yarn container between spark-2.3.0 and spark-3.2.1 when requesting containers from yarn. org.apache.spark.deploy.yarn.Client.java # verifyClusterResources ``` spark-2.3.0 val executorMem = executorMemory + executorMemoryOverhead

Re: Sizing the driver & executor cores and memory in Kubernetes cluster

2021-12-16 Thread Mich Talebzadeh
Following from the above, I did some tests on this with leaving 1 VCPU out of 4 VCPUS to the OS on each node (Container & executor) in three node GKE cluster. The RAM allocated to each node was 16GB. I then set the initial container AND executor (memory 10 10% of RAM) and incremented thes

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Nicholas Chammas
Thanks for the suggestions. I suppose I should share a bit more about what I tried/learned, so others who come later can understand why a memory-efficient, exact median is not in Spark. Spark's own ApproximatePercentile also uses QuantileSummaries internally <https://github.com/apache/sp

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Fitch, Simeon
centile expression > <https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39> > highlights that it's very memory-intensive and can easily lead to > OutOfM

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Sean Owen
trying to create a new aggregate function. It's my first time >>>>> working with Catalyst, so it's exciting---but I'm also in a bit over my >>>>> head. >>>>> >>>>> My goal is to create a function to calculate the median &g

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Pol Santamaria
he median >>>> <https://issues.apache.org/jira/browse/SPARK-26589>. >>>> >>>> As a very simple solution, I could just define median to be an alias of >>>> `Percentile(col, 0.5)`. However, the leading comment on the Percentile >>>> expression &

Sizing the driver & executor cores and memory in Kubernetes cluster

2021-12-14 Thread Mich Talebzadeh
Hi, I have a three node k8s cluster (GKE) in Google cloud with E2 standard machines that have 4 GB of system memory per VCPU giving 4 VPCU and 16,384MB of RAM. An optimum sizing of the number of executors, CPU and memory allocation is important here. These are the assumptions: 1. You want to

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas
alias of >>> `Percentile(col, >>> 0.5)`. However, the leading comment on the Percentile expression >>> <https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentil

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Reynold Xin
leading comment on the Percentile expression ( >> https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39 >> ) highlights that it's very memory-intensive and

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas
ercentile(col, > 0.5)`. However, the leading comment on the Percentile expression > <https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala#L37-L39> > highlight

Creating a memory-efficient AggregateFunction to calculate Median

2021-12-09 Thread Nicholas Chammas
gate/Percentile.scala#L37-L39> highlights that it's very memory-intensive and can easily lead to OutOfMemory errors. So instead of using Percentile, I'm trying to create an Expression that calculates the median without needing to hold everything in memory at once. I'm considering t

[DISCUSS] Reducing memory usage of toPandas with Arrow "self_destruct" option

2020-09-10 Thread David Li
Hello all, We've been working with PySpark and Pandas, and have found that to convert a dataset using N bytes of memory to Pandas, we need to have 2N bytes free, even with the Arrow optimization enabled. The fundamental reason is ARROW-3789[1]: Arrow does not free the Arrow table until conve

Re: [Spark Core] Merging PR #23340 for New Executor Memory Metrics

2020-06-30 Thread Alex Scammon
Subject: Re: [Spark Core] Merging PR #23340 for New Executor Memory Metrics HI, Alex and Michel. I removed the `Stale` label and reopened it for now. You may want to ping the original author because the last update of that PR is one year ago and has many conflicts as of today. Bests, Dongjoo

Re: [Spark Core] Merging PR #23340 for New Executor Memory Metrics

2020-06-30 Thread Dongjoon Hyun
t; *From:* Michel Sumbul > *Sent:* Thursday, June 25, 2020 11:48 AM > *To:* dev@spark.apache.org ; Alex Scammon < > alex.scam...@ext.gresearch.co.uk> > *Subject:* Re: [Spark Core] Merging PR #23340 for New Executor Memory > Metrics > > > Hey Dev team, > > I agreed wit

Re: [Spark Core] Merging PR #23340 for New Executor Memory Metrics

2020-06-30 Thread Alex Scammon
AM To: dev@spark.apache.org ; Alex Scammon Subject: Re: [Spark Core] Merging PR #23340 for New Executor Memory Metrics Hey Dev team, I agreed with Alex, theses metrics can be really usefull to tune jobs. Any chances someone can have a look at it? Thanks, Michel Le lundi 22 juin 2020 à 22:48:23

[Spark Core] Merging PR #23340 for New Executor Memory Metrics

2020-06-22 Thread Alex Scammon
Hi there devs, Congrats on Spark 3.0.0, that's great to see. I'm hoping to get some eyes on something old, however: * https://github.com/apache/spark/pull/23340 I'm really just trying to get some eyes on this PR and see if we can still move it forward. I reached out to the reviewers of th

Spark dataframe creation through already distributed in-memory data sets

2020-06-16 Thread Tanveer Ahmad - EWI
Hi all, I am new to the Spark community. Please ignore if this question doesn't make sense. My PySpark Dataframe is just taking a fraction of time (in ms) in 'Sorting', but moving data is much expensive (> 14 sec). Explanation: I have a huge Arrow RecordBatches collection which is equally

is there any mentrics to show the usage of executor on memory or CPU

2020-05-15 Thread zhangliyun
Hi all: i want to ask a question about the metrics to show the executor is fully used the memory. in the log i always saw following in the log, i guess this means i did not fully use the executor 's memory. but i don't want to open the log to view, is there any metrics to sho

Out Of Memory while reading a table partition from HIVE

2019-05-17 Thread Shivam Sharma
Hi All, I am getting Out Of Memory due to GC overhead while reading a table from HIVE from spark like: spark.sql("SELECT * FROM some.table where date='2019-05-14' LIMIT > 10").show() So when I run above command in spark-shell then it starts processing *1780 tasks*

Memory leak in SortMergeJoin

2019-04-26 Thread Tao L
Hi all, I've been hitting this issue, and hoping to get some traction going at: https://issues.apache.org/jira/browse/SPARK-21492 and PR: https://github.com/apache/spark/pull/23762 If SortMergeJoinScanner doesn't consume the iterator from UnsafeExternalRowSorter entirely, the m

Re: Tungsten Memory Consumer

2019-02-12 Thread Jack Kolokasis
Hello,     I am sorry about my first explanation, was not concrete. Well I will explain further about TaskMemoryManager. TaskMemoryManager manages the execution memory of each task application as follow:     1. MemoryConsumer is the entry for the Spark task to run. MemoryConsumer requests

Re: Tungsten Memory Consumer

2019-02-11 Thread Wenchen Fan
what do you mean by ''Tungsten Consumer"? On Fri, Feb 8, 2019 at 6:11 PM Jack Kolokasis wrote: > Hello all, > I am studying about Tungsten Project and I am wondering when Spark > creates a Tungsten consumer. While I am running some applications, I see > that Spark creates Tungsten Consumer

Tungsten Memory Consumer

2019-02-08 Thread Jack Kolokasis
Hello all,     I am studying about Tungsten Project and I am wondering when Spark creates a Tungsten consumer. While I am running some applications, I see that Spark creates Tungsten Consumer while in other applications not (using the same configuration). When does this happens ? I am looking

Numpy memory not being released in executor map-partition function (memory leak)

2018-11-20 Thread joshlk_
I believe I have uncovered a strange interaction between pySpark, Numpy and Python which produces a memory leak. I wonder if anyone has any ideas of what the issue could be? I have the following minimal working example ( gist of code <https://gist.github.com/jos

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
ppreciated! >> >> Peter >> >> On Fri, Oct 19, 2018 at 9:38 AM Peter Rudenko >> wrote: >> >>> Hey Peter, in SparkRDMA shuffle plugin ( >>> https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle >>> file, to do Remote Dir

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
t 19, 2018 at 9:38 AM Peter Rudenko > wrote: > >> Hey Peter, in SparkRDMA shuffle plugin ( >> https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file, >> to do Remote Direct Memory Access. If the shuffle data is bigger then RAM, >> Mellanox NIC

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
: > Hey Peter, in SparkRDMA shuffle plugin ( > https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file, > to do Remote Direct Memory Access. If the shuffle data is bigger then RAM, > Mellanox NIC support On Demand Paging, where OS invalidates translations > which are

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
Hey Peter, in SparkRDMA shuffle plugin ( https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file, to do Remote Direct Memory Access. If the shuffle data is bigger then RAM, Mellanox NIC support On Demand Paging, where OS invalidates translations which are no longer valid d

Re: Spark In Memory Shuffle / 5403

2018-10-18 Thread Peter Liu
I would be very interested in the initial question here: is there a production level implementation for memory only shuffle and configurable (similar to MEMORY_ONLY storage level, MEMORY_OR_DISK storage level) as mentioned in this ticket, https://github.com/apache/spark/pull/5403 ? It would be

Off Heap Memory

2018-09-11 Thread Jack Kolokasis
Hello,     I recently start studying the Spark's memory management system. More spesifically I want to understand how spark use the off-Heap memory. Interanlly I saw, that there are two types of offHeap memory. (offHeapExecutionMemoryPool and offHeapStorageMemoryPool).     How Spark us

Re: Reading 20 GB of log files from Directory - Out of Memory Error

2018-08-25 Thread Chetan Khatri
s._2.split("[^A-Za-z']+".replaceAll("""\n"""," "))) Thanks On Sat, Aug 25, 2018 at 3:38 PM Chetan Khatri wrote: > Hello Spark Dev Community, > > Friend of mine is facing issue while reading 20 GB of log files from > Directory on Cl

Reading 20 GB of log files from Directory - Out of Memory Error

2018-08-25 Thread Chetan Khatri
Hello Spark Dev Community, Friend of mine is facing issue while reading 20 GB of log files from Directory on Cluster. Approach are as below: *1. This gives out of memory error.* val logRDD = sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*") val

Off Heap Memory

2018-08-24 Thread Jack Kolokasis
Hello,     I recently start studying the Spark's memory management system. My question is about the offHeapExecutionMemoryPool and offHeapStorageMemoryPool.     1. How Spark use the offHeapExecutionMemoryPool ?     2. How use the offHeap memory (I understand the allocation side), but

offheap memory usage & netty configuration

2018-07-26 Thread Imran Rashid
*I’ve been looking at where untracked memory is getting used in spark, especially offheap memory, and I’ve discovered some things I’d like to share with the community. Most of what I’ve learned has been about the way spark is using netty -- I’ll go into some more detail about that below. I’m also

Why can per task‘s memory only reach 1 / numTasks , not greater than 1 / numTasks in ExecutionMemoryPool ?

2018-06-05 Thread John Fang
In fact not all tasks belong to the same stage. Thus, per task may be is deferent for the dependence of memory. For example, the executor are running two tasks(A and B), and the ExecutionMemoryPool own 1000M. We can hope the task-A occupy 900M, and task-B occupy 100M due to the task-A need much

Re: cache OS memory and spark usage of it

2018-04-10 Thread Jose Raul Perez Rodriguez
it was helpful, Then, the OS needs to fill some pressure from the applications requesting memory to free some memory cache? Exactly under which circumstances the OS free that memory to give it to applications requesting it? I mean if the total memory is 16GB and 10GB are used for OS cache

cache OS memory and spark usage of it

2018-04-10 Thread José Raúl Pérez Rodríguez
Hi, When I issue a "free -m" command in a host, I see a lot of memory used for cache in OS, however Spark Streaming is not able to request that memory for its usage, and it fail the execution due to not been able to launch executors. What I understand of the OS memory cache (

pyspark.daemon exhaust a lot of memory

2018-04-09 Thread Niu Zhaojie
Hi All, We are running spark 2.1.1 on Hadoop YARN 2.6.5. We found the pyspark.daemon process consume more than 300GB memory. However, according to https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals, the daemon process shouldn't have this problem. Also, we find the d

RDD checkpoint failures in case of insufficient memory

2018-03-14 Thread Sergey Zhemzhitsky
Hi there, A while ago running GraphX jobs I've discovered that PeriodicRDDCheckpointer fails with FileNotFoundException's in case of insufficient memory resources. I believe that any iterative job which uses PeriodicRDDCheckpointer (like ML) suffers from the same issue (but it

Re: Faster and Lower memory implementation toPandas

2017-11-20 Thread gmcrosh
implemented for this too. I agree that serializing the data to a pandas dataframe or numpy array is faster and less memory intensive. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e

Re: Faster and Lower memory implementation toPandas

2017-11-16 Thread Reynold Xin
Please send a PR. Thanks for looking at this. On Thu, Nov 16, 2017 at 7:27 AM Andrew Andrade wrote: > Hello devs, > > I know a lot of great work has been done recently with pandas to spark > dataframes and vice versa using Apache Arrow, but I faced a specific pain > point on a l

Faster and Lower memory implementation toPandas

2017-11-16 Thread Andrew Andrade
Hello devs, I know a lot of great work has been done recently with pandas to spark dataframes and vice versa using Apache Arrow, but I faced a specific pain point on a low memory setup without Arrow. Specifically I was finding a driver OOM running a toPandas on a small dataset (<100

Broadcast Memory Management

2017-09-20 Thread Matthias Boehm
Hi all, could someone please help me understand the broadcast life cycle in detail, especially with regard to memory management? After reading through the TorrentBroadcast implementation, it seems that for every broadcast object, the driver holds a strong reference to a shallow copy (in

Re: Total memory tracking: request for comments

2017-09-20 Thread Reynold Xin
Thanks. This is an important direction to explore and my apologies for the late reply. One thing that is really hard about this is that with different layers of abstractions, we often use other libraries that might allocate large amount of memory (e.g. snappy library, Parquet itself), which makes

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-03 Thread Chetan Khatri
Thanks Holden ! On Thu, Aug 3, 2017 at 4:02 AM, Holden Karau wrote: > The memory overhead is based less on the total amount of data and more on > what you end up doing with the data (e.g. if your doing a lot of off-heap > processing or using Python you need to increase it). Hone

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Holden Karau
The memory overhead is based less on the total amount of data and more on what you end up doing with the data (e.g. if your doing a lot of off-heap processing or using Python you need to increase it). Honestly most people find this number for their job "experimentally" (e.g. they

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Chetan Khatri
spark.memory.fraction setting. number of partitions = 674 Cluster: 455 GB total memory, VCores: 288, Nodes: 17 Given / tried memory config: executor-mem = 16g, num-executor=10, executor cores=6, driver mem=4g spark.default.parallelism=1000 spark.sql.shuffle.partitions=1000 spark.yarn.executor.memoryOverhead

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Ryan Blue
en columns you normally use to filter when reading the table. I generally recommend the second approach because it handles skew and prepares the data for more efficient reads. If that doesn't help, then you should look at your memory settings. When you're getting killed by YARN, you sho

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Chetan Khatri
ntainer killed by YARN for exceeding memory limits. 14.0 > GB of 14 GB physical memory used. Consider boosting spark.yarn.executor. > memoryOverhead. > > Driver memory=4g, executor mem=12g, num-executors=8, executor core=8 > > Do you think below setting can help me to

Improvement for memory config.

2017-06-30 Thread jinxing
1. For executor memory, we have spark.executor.memory for heap size, and spark.memory.offHeap.size for off-heap size, and these 2 together is the total memory consumption for each executor process. From the user side, what they always care is the total memory consumption, no matter it is on

Total memory tracking: request for comments

2017-06-20 Thread Jose Soltren
https://issues.apache.org/jira/browse/SPARK-21157 Hi - often times, Spark applications are killed for overrunning available memory by YARN, Mesos, or the OS. In SPARK-21157, I propose a design for grabbing and reporting "total memory" usage for Spark executors - that is, memory usage

Re: Memory issue in pyspark for 1.6 mb file

2017-06-17 Thread Pralabh Kumar
Hi Naga Is it failing because of driver memory full or executor memory full ? can you please try setting this property spark.cleaner.ttl ? . So that older RDDs /metadata should also get clear automatically. Can you please provide the complete error stacktrace and code snippet ?. Regards

Memory issue in pyspark for 1.6 mb file

2017-06-17 Thread Naga Guduru
Hi, I am trying to load 1.6 mb excel file which has 16 tabs. We converted excel to csv and loaded 16 csv files to 8 tables. Job was running successful in 1st run in pyspark. When trying to run the same job 2 time, container getting killed due to memory issues. I am using unpersist and clearcache

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-02-13 Thread StanZhai
ocess(--executor-memory 30G), as follow: test@test Online ~ $ ps aux | grep CoarseGrainedExecutorBackend test 105371 106 21.5 67325492 42621992 ? Sl 15:20 55:14 /home/test/service/jdk/bin/java -cp /home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/serv

Re: Driver hung and happend out of memory while writing to console progress bar

2017-02-10 Thread Ryan Blue
This isn't related to the progress bar, it just happened while in that section of code. Something else is taking memory in the driver, usually a broadcast table or something else that requires a lot of memory and happens on the driver. You should check your driver memory settings and the

回复:Driver hung and happend out of memory while writing to console progress bar

2017-02-09 Thread John Fang
the spark version is 2.1.0 --发件人:方孝健(玄弟) 发送时间:2017年2月10日(星期五) 12:35收件人:spark-dev ; spark-user 主 题:Driver hung and happend out of memory while writing to console progress bar [Stage 172

Driver hung and happend out of memory while writing to console progress bar

2017-02-09 Thread John Fang
[Stage 172:==> (10328 + 93) / 16144] [Stage 172:==> (10329 + 93) / 16144] [Stage 172:==> (10330 + 93) / 16144] [Stage 172:==>

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-02-07 Thread StanZhai
ate like this: <http://apache-spark-developers-list.1001551.n3.nabble.com/file/n20881/QQ20170207-212340.png> The exceed off-heap memory may be caused by these abnormal threads. This problem occurs only when writing data to the Hadoop(tasks may be killed by Executor during writing). Could

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-02-03 Thread Jacek Laskowski
Hi, Just to throw few zlotys to the conversation, I believe that Spark Standalone does not enforce any memory checks to limit or even kill executors beyond requested memory (like YARN). I also found that memory does not have much of use while scheduling tasks and CPU matters only. My

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-02-02 Thread StanZhai
an 22, 2017, at 11:36 PM, StanZhai < > mail@ > > wrote: >> >> I'm using Parallel GC. >> rxin wrote >>> Are you using G1 GC? G1 sometimes uses a lot more memory than the size >>> allocated. >>> >>> >>> On Sun, Jan 22, 201

Re: A question about creating persistent table when in-memory catalog is used

2017-01-26 Thread Shuai Lin
t;> >>>>> BTW, we still can create the regular data source tables and insert the >>>>> data into the tables. The major difference is whether the metadata is >>>>> persistently stored or not. >>>>> >>>>> Thanks, >&g

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-01-23 Thread Michael Allman
Hi Stan, What OS/version are you using? Michael > On Jan 22, 2017, at 11:36 PM, StanZhai wrote: > > I'm using Parallel GC. > rxin wrote >> Are you using G1 GC? G1 sometimes uses a lot more memory than the size >> allocated. >> >> >> On Sun,

Re: A question about creating persistent table when in-memory catalog is used

2017-01-23 Thread Xiao Li
t;>> >>>> Xiao Li >>>> >>>> 2017-01-22 11:14 GMT-08:00 Reynold Xin : >>>> >>>> I think this is something we are going to change to completely decouple >>>> the Hive support and catalog. >>>> >>>> >>>>

Re: A question about creating persistent table when in-memory catalog is used

2017-01-23 Thread Shuai Lin
difference is whether the metadata is >>> persistently stored or not. >>> >>> Thanks, >>> >>> Xiao Li >>> >>> 2017-01-22 11:14 GMT-08:00 Reynold Xin : >>> >>> I think this is something we are going to ch

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-01-22 Thread StanZhai
I'm using Parallel GC. rxin wrote > Are you using G1 GC? G1 sometimes uses a lot more memory than the size > allocated. > > > On Sun, Jan 22, 2017 at 12:58 AM StanZhai < > mail@ > > wrote: > >> Hi all, >> >> >> >> We j

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-01-22 Thread Koert Kuipers
could this be related to SPARK-18787? On Sun, Jan 22, 2017 at 1:45 PM, Reynold Xin wrote: > Are you using G1 GC? G1 sometimes uses a lot more memory than the size > allocated. > > > On Sun, Jan 22, 2017 at 12:58 AM StanZhai wrote: > >> Hi all, >> >> >>

Re: A question about creating persistent table when in-memory catalog is used

2017-01-22 Thread Xiao Li
> >> 2017-01-22 11:14 GMT-08:00 Reynold Xin : >> >> I think this is something we are going to change to completely decouple >> the Hive support and catalog. >> >> >> On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin wrote: >> >> Hi all, >> >&g

Re: A question about creating persistent table when in-memory catalog is used

2017-01-22 Thread Reynold Xin
Hive support and catalog. > > > On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin wrote: > > Hi all, > > Currently when the in-memory catalog is used, e.g. through `--conf > spark.sql.catalogImplementation=in-memory`, we can create a persistent > table, but inserting into th

Re: A question about creating persistent table when in-memory catalog is used

2017-01-22 Thread Xiao Li
te: > >> Hi all, >> >> Currently when the in-memory catalog is used, e.g. through `--conf >> spark.sql.catalogImplementation=in-memory`, we can create a persistent >> table, but inserting into this table would fail with error message "Hive >> support is required to inser

Re: A question about creating persistent table when in-memory catalog is used

2017-01-22 Thread Reynold Xin
I think this is something we are going to change to completely decouple the Hive support and catalog. On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin wrote: > Hi all, > > Currently when the in-memory catalog is used, e.g. through `--conf > spark.sql.catalogImplementation=in-memory`, we

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-01-22 Thread Reynold Xin
Are you using G1 GC? G1 sometimes uses a lot more memory than the size allocated. On Sun, Jan 22, 2017 at 12:58 AM StanZhai wrote: > Hi all, > > > > We just upgraded our Spark from 1.6.2 to 2.1.0. > > > > Our Spark application is started by spark-submit with confi

A question about creating persistent table when in-memory catalog is used

2017-01-22 Thread Shuai Lin
Hi all, Currently when the in-memory catalog is used, e.g. through `--conf spark.sql.catalogImplementation=in-memory`, we can create a persistent table, but inserting into this table would fail with error message "Hive support is required to insert into the following tables..". s

Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-01-22 Thread StanZhai
Hi all, We just upgraded our Spark from 1.6.2 to 2.1.0. Our Spark application is started by spark-submit with config of `--executor-memory 35G` in standalone model, but the actual use of memory up to 65G after a full gc(jmap -histo:live $pid) as follow: test@c6 ~ $ ps aux | grep

Re: Reduce memory usage of UnsafeInMemorySorter

2016-12-20 Thread Liang-Chi Hsieh
Hi Nick, The scope of the PR I submitted is reduced because we can't make sure if it is really the root cause of the error you faced. You can check out the discussion on the PR. So I can just change the assert in the code as shown in the PR. If you can have a repro, we can go back to see if it i

Re: Reduce memory usage of UnsafeInMemorySorter

2016-12-08 Thread Kazuaki Ishizaki
Subject:Re: Reduce memory usage of UnsafeInMemorySorter Unfortunately, I don't have a repro, and I'm only seeing this at scale. But I was able to get around the issue by fiddling with the distribution of my data before asking GraphFrames to process it. (I think that's where

Re: Reduce memory usage of UnsafeInMemorySorter

2016-12-07 Thread Nicholas Chammas
.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L156 > > Regards, > Kazuaki Ishizaki > > > > From:Reynold Xin > To:Nicholas Chammas > Cc:Spark dev list > D

Re: Reduce memory usage of UnsafeInMemorySorter

2016-12-07 Thread Kazuaki Ishizaki
org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L156 Regards, Kazuaki Ishizaki From: Reynold Xin To: Nicholas Chammas Cc: Spark dev list Date: 2016/12/07 14:27 Subject:Re: Reduce memory usage of UnsafeInMemorySorter This is not supposed to happen. Do

Re: Reduce memory usage of UnsafeInMemorySorter

2016-12-06 Thread Reynold Xin
ensure that hasSpaceForAnotherRecord() returns a true value? > > Do I need: > > - More, smaller partitions? >- More memory per executor? >- Some Java or Spark option enabled? >- etc. > > I’m running Spark 2.0.2 on Java 7 and YARN. Would Java 8 help here? > (

Reduce memory usage of UnsafeInMemorySorter

2016-12-06 Thread Nicholas Chammas
refined question now: How can I ensure that UnsafeInMemorySorter has room to insert new records? In other words, how can I ensure that hasSpaceForAnotherRecord() returns a true value? Do I need: - More, smaller partitions? - More memory per executor? - Some Java or Spark option enabled

Re: Parquet-like partitioning support in spark SQL's in-memory columnar cache

2016-11-28 Thread Nitin Goyal
+Cheng Hi Reynold, I think you are referring to bucketing in in-memory columnar cache. I am proposing that if we have a parquet structure like following :- //file1/id=1/ //file1/id=2/ and if we read and cache it, it should create 2 RDD[CachedBatch] (each per value of "id") Is thi

Re: Parquet-like partitioning support in spark SQL's in-memory columnar cache

2016-11-24 Thread Reynold Xin
It's already there isn't it? The in-memory columnar cache format. On Thu, Nov 24, 2016 at 9:06 PM, Nitin Goyal wrote: > Hi, > > Do we have any plan of supporting parquet-like partitioning support in > Spark SQL in-memory cache? Something like one RDD[CachedBatch]

Parquet-like partitioning support in spark SQL's in-memory columnar cache

2016-11-24 Thread Nitin Goyal
Hi, Do we have any plan of supporting parquet-like partitioning support in Spark SQL in-memory cache? Something like one RDD[CachedBatch] per in-memory cache partition. -Nitin

Re: Memory leak warnings in Spark 2.0.1

2016-11-23 Thread Nicholas Chammas
cham...@gmail.com> wrote: > > I'm also curious about this. Is there something we can do to help > troubleshoot these leaks and file useful bug reports? > > On Wed, Oct 12, 2016 at 4:33 PM vonnagy wrote: > > I am getting excessive memory leak warnings when running multipl

Re: Memory leak warnings in Spark 2.0.1

2016-11-22 Thread Reynold Xin
ot these leaks and file useful bug reports? > > On Wed, Oct 12, 2016 at 4:33 PM vonnagy wrote: > >> I am getting excessive memory leak warnings when running multiple mapping >> and >> aggregations and using DataSets. Is there anything I should be looking for >&

Re: Memory leak warnings in Spark 2.0.1

2016-11-21 Thread Nicholas Chammas
I'm also curious about this. Is there something we can do to help troubleshoot these leaks and file useful bug reports? On Wed, Oct 12, 2016 at 4:33 PM vonnagy wrote: > I am getting excessive memory leak warnings when running multiple mapping > and > aggregations and using Data

Re: Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread WangJianfei
with predError.zip(input) ,we get RDD data, so we can just do a sample on predError or input, if so, we can't use zip(the elements number must be the same in each partition),thank you! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-m

回复: Reduce the memory usage if we do same first inGradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread WangJianfei
st]";; 发送时间: 2016年11月16日(星期三) 凌晨3:54 收件人: "WangJianfei"; 主题: Re: Reduce the memory usage if we do same first inGradientBoostedTrees if subsamplingRate< 1.0 Thanks for the suggestion. That would be faster, but less accurate in most cases. It's generally bet

Re: Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread Joseph Bradley
; wrote: > when we train the mode, we will use the data with a subSampleRate, so if > the > subSampleRate < 1.0 , we can do a sample first to reduce the memory usage. > se the code below in GradientBoostedTrees.boost() > > while (m < numIterations && !doneLearning)

Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0

2016-11-11 Thread WangJianfei
when we train the mode, we will use the data with a subSampleRate, so if the subSampleRate < 1.0 , we can do a sample first to reduce the memory usage. se the code below in GradientBoostedTrees.boost() while (m < numIterations && !doneLearning) { // Update data with pseudo-

  1   2   3   4   >