On Thu, Apr 19, 2018 at 04:37:23PM +0100, Bruce Richardson wrote: > On Thu, Apr 19, 2018 at 08:48:33PM +0530, Pavan Nikhilesh wrote: > > On Thu, Apr 19, 2018 at 01:09:58PM +0100, Bruce Richardson wrote: > > > On Thu, Apr 19, 2018 at 02:50:52PM +0530, Pavan Nikhilesh wrote: > > > > On Wed, Apr 18, 2018 at 07:03:06PM +0100, Ferruh Yigit wrote: > > > > > On 4/18/2018 6:55 PM, Pavan Nikhilesh wrote: > > > > > > On Wed, Apr 18, 2018 at 06:43:11PM +0100, Ferruh Yigit wrote: > > > > > >> On 4/18/2018 4:30 PM, Pavan Nikhilesh wrote: > > > > > >>> Add macro to mark a variable to be mostly read only and place it > > > > > >>> in a > > > > > >>> separate section. > > > > > >>> > > > > > >>> Signed-off-by: Pavan Nikhilesh <pbhagavat...@caviumnetworks.com> > > > > > >>> --- > > > > > >>> > > > > > >>> Group together mostly read only data to avoid cacheline > > > > > >>> bouncing, also > > > > > >>> useful for auditing purposes. > > > > > >>> > > > > > >>> lib/librte_eal/common/include/rte_common.h | 5 +++++ > > > > > >>> 1 file changed, 5 insertions(+) > > > > > >>> > > > > > >>> diff --git a/lib/librte_eal/common/include/rte_common.h > > > > > >>> b/lib/librte_eal/common/include/rte_common.h > > > > > >>> index 6c5bc5a76..f2ff2e9e6 100644 > > > > > >>> --- a/lib/librte_eal/common/include/rte_common.h > > > > > >>> +++ b/lib/librte_eal/common/include/rte_common.h > > > > > >>> @@ -114,6 +114,11 @@ static void > > > > > >>> __attribute__((constructor(prio), used)) func(void) > > > > > >>> */ > > > > > >>> #define __rte_noinline __attribute__((noinline)) > > > > > >>> > > > > > >>> +/** > > > > > >>> + * Mark a variable to be mostly read only and place it in a > > > > > >>> separate section. > > > > > >>> + */ > > > > > >>> +#define __rte_read_mostly > > > > > >>> __attribute__((__section__(".read_mostly"))) > > > > > >> > > > > > > > > > > > > Hi Ferruh, > > > > > > > > > > > >> Hi Pavan, > > > > > >> > > > > > >> Is the section ".read_mostly" treated specially [1] or is this > > > > > >> just for grouping > > > > > >> symbols together (to reduce cacheline bouncing)? > > > > > > > > > > > > The section .read_mostly is not treated specially it's just for > > > > > > grouping > > > > > > symbols. > > > > > > > > > > I have encounter with a blog post claiming this is not working: > > > > > > > > > > " > > > > > The problem with the above approach is that once all the > > > > > __read_mostly variables > > > > > are grouped into one section, the remaining "non-read-mostly" > > > > > variables end-up > > > > > together too. This increases the chances that two frequently used > > > > > elements (in > > > > > the "non-read-mostly" region) will end-up competing for the same > > > > > position (or > > > > > cache-line, the basic fixed-sized block for memory<-->cache > > > > > transfers) in the > > > > > cache. Thus frequent accesses will cause excessive cache thrashing on > > > > > that > > > > > particular cache-line thereby degrading the overall system > > > > > performance. > > > > > " > > > > > > > > > > https://thecodeartist.blogspot.com/2011/12/why-readmostly-does-not-work-as-it.html > > > > > > > > > > > > > The author is concerned about processors with less cache > > > > set-associativity, > > > > almost all modern processors have >= 16 way set associativity. And the > > > > above > > > > issue can happen even now when two frequently written global variables > > > > are > > > > placed next to each other. > > > > > > > > Currently, we don't have much control over how the global variables are > > > > arranged and a single addition/deletion to the global variables causes > > > > change > > > > in alignment and in some cases minor performance regression. > > > > Tagging them as __read_mostly we can easily identify the alignment > > > > changes > > > > across builds by comparing map files global variable section. > > > > > > > > I have verified the patch-set on arm64 (16-way set-associative) and > > > > didn't > > > > notice any performance regression. > > > > Did you have a chance to verify if there is any performance regression? > > > > > > > Is there a performance improvement? It's seems a relatively strange change > > > to me, so I'd like to know that it really improves performance in test > > > cases. > > > > We had a performance regression of ~200k between 17.11 and 18.02 due > > enabling > > dpaa/dpaa2 in default config this was due to new global variables being > > added > > and changing the alignment. > > Moving read mostly global variables (logtypes/device arrays) to a separate > > section helps when tracking performance regression between builds. > > > So it's of use when debugging, rather than providing a performance boost in > and of itself, right? > > If performance regressions are appearing, should we then see about marking > globals with __rte_cache_align to force them all onto difference > cachelines?
I think that would be a bit of overkill considering the number of logtype variables. Currently there are 29 global variables not found in any map file List of globals ("['fw_file', 'mode_8023ad_ports', 'ecore_mz_count', 'igb_filter_rss_list', " "'crc16_ccitt_pmull', 'igb_filter_flex_list', 'rte_crypto_devices', " "'igb_flow_list', 'qman_clk', 'bman_ccsr_map', 'dpaa_portal_key', " "'bman_ip_rev', 'cons_filter', 'rte_rawdevices', 'internal_config', " "'bman_pool_max', 'qman_version', 'qman_ccsr_map', 'rte_event_devices', " "'qman_ip_rev', 'netcfg_interface', 'igb_filter_ntuple_list', " "'igb_filter_ethertype_list', 'skeldev_init_once', 'igb_filter_syn_list', " "'crc32_eth_pmull', 'global_portals_used', 'virtio_hw_internal', " "'vhost_devices']") These are stats without including the logtype variables across all drivers. With logtype variables included there are 70+.