Github user jerryshao commented on the pull request:
https://github.com/apache/spark/pull/5064#issuecomment-119508679
Hi @tdas and @andrewor14 , I tested a lot on the memory consumption of
`TaskMetrics` and related `_hostname` string.
Here I use `DirectKafkaWordCount` as test workload with task number to 1 as
a minimal setting, also 1 master + 1 slave with standalone mode.
According to my profiling with driver processor using JProfiler, the
instance number of `TaskMetrics` is at least around 2000 (with full GC
triggered), you could refer to this chart:

so if we linearly increase the task number, say to 1000 (for a middle scale
cluster), we will get at least 1000 * 2000 (2M) `TaskMetrics` objects, also 2M
`_hostname` objects in the previous code, if each `_hostname` occupies 64
bytes, so totally 128M memory will be occupied for `_hostname` objects, this is
proportional to the task number and `TaskMetrics`.
And for now in my implementation, the memory occupation of `_hostname` is
proportional to the cluster size (no relation to the task number, numbers of
`TaskMetrics`), say if we have 1000 nodes in cluster, the total memory
occupation of `_hostname` will be 1000 * 64 Bytes with one additional hashmap.
So actually this change does reduce the memory consumption (though not so
many), it is more evident in the long-running and large scale cases.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]