[GitHub] spark pull request: [SPARK-5523][Core][Streaming] Add a cache for ...

jerryshao Wed, 08 Jul 2015 02:10:11 -0700

Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/5064#issuecomment-119508679
  
    Hi @tdas and @andrewor14 , I tested a lot on the memory consumption of 
`TaskMetrics` and related `_hostname` string.
    
    Here I use `DirectKafkaWordCount` as test workload with task number to 1 as 
a minimal setting, also 1 master + 1 slave with standalone mode.
    
    According to my profiling with driver processor using JProfiler, the 
instance number of `TaskMetrics` is at least around 2000 (with full GC 
triggered), you could refer to this chart:
    
![image](https://cloud.githubusercontent.com/assets/850797/8566588/bc80d1bc-2591-11e5-9f0d-f50c5ce5405a.png)
    
    so if we linearly increase the task number, say to 1000 (for a middle scale 
cluster), we will get at least 1000 * 2000 (2M) `TaskMetrics` objects, also 2M 
`_hostname` objects in the previous code, if each `_hostname` occupies 64 
bytes, so totally 128M memory will be occupied for `_hostname` objects, this is 
proportional to the task number and `TaskMetrics`.
    
    And for now in my implementation, the memory occupation of `_hostname` is 
proportional to the cluster size (no relation to the task number, numbers of 
`TaskMetrics`), say if we have 1000 nodes in cluster, the total memory 
occupation of `_hostname` will be 1000 * 64 Bytes with one additional hashmap.
    
    So actually this change does reduce the memory consumption (though not so 
many), it is more evident in the long-running and large scale cases.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5523][Core][Streaming] Add a cache for ...

Reply via email to