Raymond Hettinger <raymond.hettin...@gmail.com> added the comment:

Thanks for the link.  This is a worthwhile experiment.  However, the potential 
gains will be hard to come by.

The workload of LOAD_CONST is very small.  After paying for the usual dispatch 
logic overhead, all it does is an index into a struct member and an incref.  
Both the co_consts table and the popular constant objects are already likely to 
be in the L1 data cache.  


        ##DEBUG_LABEL: TARGET_LOAD_CONST
        movslq  %r15d, %rax             ## OpArg fetch, typically a zero code 
register rename   
        movq    -368(%rbp), %rcx        ## 8-byte Reload to access co_consts
        movq    24(%rcx,%rax,8), %rax   ## The actual indexing operation  (3 
cycles)
        incq    (%rax)                  ## The incref  


A specialized opcode for a specific constant like None can 1) eliminate the 
oparg fetch (likely saving nothing), and 2) eliminate the two sequentially 
dependent memory access (this is a win):

        ##DEBUG_LABEL: TARGET_LOAD_NONE
      ​  movq   __Py_NoneStruct@GOTPCREL(%rip), rax
        incq    (%rax)                  ## The incref


Any more general opcode for loading small ints would still need the oparg fetch 
and the incref.  To win, it would need to convert the oparg into an int more 
efficiently than the two movq steps.  If the small int table is in a fixed 
location (not per-subinterpreter), then you can save 2 cycles with the simpler 
address computation:

        ##DEBUG_LABEL: TARGET_SMALLINT
        movslq  %r15d, %rax             ## OpArg fetch, typically a zero code 
register rename 
        movq    __Py_SmallInt@GOTPCREL(%rip), %rcx        ## Find an array of 
ints
        movq    (%rcx,%rax,8), %rax     ## Cheaper address computation takes 1 
cycle
        incq    (%rax)                  ## The incref 

The 2 cycle win (intel-only) will be partially offset by the additional 
pressure on the L1 data cache.  Right now, the co_consts is almost certainly in 
cache, holding only the constants that actually get used (at a rate of 8 per 
cache line).  Accesses into a small_int array will push other data out of L1.

IIRC, Serhiy already experimented with a LOAD_NONE opcode and couldn't get a 
measurable win.

----------
nosy: +serhiy.storchaka

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue45152>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to