Tail-call interpreter dispatch #120

malcolmstill · 2021-10-03T23:19:00Z

Description

In comparison to #118, this PR:
a) works!
b) retains the original 3-stack system

#118 attempts to use more parameters in the tail-call functions in an effort to get the compiler to fix them in registers. To try and minimise the parameters I was trying to unify the 3 stacks into a single stack which would mix control flow and operands (kind of like the native stack). Whilst it is working to an extent (fib works) the testsuite isn't passing and the things that don't work I need to spend more time figuring out how to implement in the single-stack worldview.

#118 also tends to degrade the code quality, mostly through trying to get the stack pointer into the function parameters. Admittedly #118 is a bit faster than this PR, but I do prefer the code of this PR.

To reiterate:

we now have tail-call interpreter dispatch. This gives us a nice speed boost on fib(39), over 2x from 26 seconds to 12 seconds. I think this is mostly unrelated to branch prediction (which I will say more on) and boils down just to the quantity of code, with the compiler doing better with the separate functions, rather than one giant switch statement
the code quality mostly remains. Due to broken LLVM module found: musttail call must precede a ret with an optional bitcast ziglang/zig#5692, we lose our try and resort to passing in an optional pointer to a custom error union WasmError. Hopefully that issue gets fixed and we can go back to try, though there may be some reason why that won't be possible. On balance I can live with the optional pointer.

Performance

This PR

perf -d -d -d stat ./fib:

fib(39) = 63245986

 Performance counter stats for './fib':

         12,397.63 msec task-clock                #    1.000 CPUs utilized          
               131      context-switches          #   10.567 /sec                   
                 2      cpu-migrations            #    0.161 /sec                   
                24      page-faults               #    1.936 /sec                   
    39,424,853,935      cycles                    #    3.180 GHz                      (28.56%)
    10,592,820,823      stalled-cycles-frontend   #   26.87% frontend cycles idle     (28.56%)
    78,865,250,519      instructions              #    2.00  insn per cycle         
                                                  #    0.13  stalled cycles per insn  (35.71%)
     7,680,828,125      branches                  #  619.540 M/sec                    (35.72%)
       358,732,112      branch-misses             #    4.67% of all branches          (35.72%)
    35,630,774,927      L1-dcache-loads           #    2.874 G/sec                    (28.57%)
         2,375,000      L1-dcache-load-misses     #    0.01% of all L1-dcache accesses  (14.29%)
           393,521      LLC-loads                 #   31.742 K/sec                    (14.29%)
   <not supported>      LLC-load-misses                                             
   <not supported>      L1-icache-loads                                             
         2,973,606      L1-icache-load-misses                                         (21.44%)
    35,648,628,162      dTLB-loads                #    2.875 G/sec                    (21.42%)
           242,155      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (14.28%)
           144,986      iTLB-loads                #   11.695 K/sec                    (14.28%)
           145,246      iTLB-load-misses          #  100.18% of all iTLB cache accesses  (21.41%)
   <not supported>      L1-dcache-prefetches                                        
           385,581      L1-dcache-prefetch-misses #   31.101 K/sec                    (28.55%)

      12.399759040 seconds time elapsed

      12.304415000 seconds user
       0.002956000 seconds sys

Previous master

fib(39) = 63245986

 Performance counter stats for './fib':

         25,520.93 msec task-clock                #    1.000 CPUs utilized          
               113      context-switches          #    4.428 /sec                   
                10      cpu-migrations            #    0.392 /sec                   
                26      page-faults               #    1.019 /sec                   
    81,344,938,138      cycles                    #    3.187 GHz                      (28.57%)
    31,119,585,302      stalled-cycles-frontend   #   38.26% frontend cycles idle     (28.57%)
   128,751,238,578      instructions              #    1.58  insn per cycle         
                                                  #    0.24  stalled cycles per insn  (35.71%)
    16,658,200,203      branches                  #  652.727 M/sec                    (35.71%)
       809,936,529      branch-misses             #    4.86% of all branches          (35.71%)
    60,359,256,646      L1-dcache-loads           #    2.365 G/sec                    (28.56%)
         4,012,985      L1-dcache-load-misses     #    0.01% of all L1-dcache accesses  (14.29%)
           200,827      LLC-loads                 #    7.869 K/sec                    (14.29%)
   <not supported>      LLC-load-misses                                             
   <not supported>      L1-icache-loads                                             
         4,631,769      L1-icache-load-misses                                         (21.43%)
    60,383,842,949      dTLB-loads                #    2.366 G/sec                    (21.43%)
           356,086      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (14.29%)
           449,370      iTLB-loads                #   17.608 K/sec                    (14.29%)
       255,509,966      iTLB-load-misses          # 56859.60% of all iTLB cache accesses  (21.43%)
   <not supported>      L1-dcache-prefetches                                        
           422,578      L1-dcache-prefetch-misses #   16.558 K/sec                    (28.57%)

      25.522272275 seconds time elapsed

      25.356742000 seconds user
       0.000960000 seconds sys

PR for #118:

fib(39) = 63245986

 Performance counter stats for './fib':

         11,623.40 msec task-clock                #    1.000 CPUs utilized          
                51      context-switches          #    4.388 /sec                   
                 6      cpu-migrations            #    0.516 /sec                   
                28      page-faults               #    2.409 /sec                   
    37,004,194,045      cycles                    #    3.184 GHz                      (28.57%)
     9,688,790,344      stalled-cycles-frontend   #   26.18% frontend cycles idle     (28.57%)
    67,639,813,775      instructions              #    1.83  insn per cycle         
                                                  #    0.14  stalled cycles per insn  (35.72%)
     6,928,175,897      branches                  #  596.054 M/sec                    (35.73%)
       355,266,757      branch-misses             #    5.13% of all branches          (35.74%)
    32,694,252,761      L1-dcache-loads           #    2.813 G/sec                    (28.56%)
         1,729,924      L1-dcache-load-misses     #    0.01% of all L1-dcache accesses  (14.28%)
           249,298      LLC-loads                 #   21.448 K/sec                    (14.28%)
   <not supported>      LLC-load-misses                                             
   <not supported>      L1-icache-loads                                             
         2,342,573      L1-icache-load-misses                                         (21.42%)
    32,659,282,307      dTLB-loads                #    2.810 G/sec                    (21.41%)
           195,095      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (14.28%)
           144,515      iTLB-loads                #   12.433 K/sec                    (14.28%)
            96,345      iTLB-load-misses          #   66.67% of all iTLB cache accesses  (21.42%)
   <not supported>      L1-dcache-prefetches                                        
           177,433      L1-dcache-prefetch-misses #   15.265 K/sec                    (28.56%)

      11.624573685 seconds time elapsed

      11.543878000 seconds user
       0.000997000 seconds sys

My understanding of the above perfs:

This PR is twice as fast as master. Tail call interpreter dispatch + unified stack #118 is faster still (but not a huge amount)
It doesn't seem like master is suffering from branch-misprediction, despite having a single jump point. I.e. the branch-predictor of this machine isn't stupid.
None of them appear to suffer much from L1-cache-misses.
So more than anything else I think it just boils down to the amount of machine code executed for each variant. The instruction counts are roughly in proportion to the relative speed.
This implies that we will get more performance by optimising the generated machine code (i.e. trying to minimise the function bodies)

Future work

can we trim down the existing instruction functions...I feel like we're doing more work than is required of a validated wasm module (e.g. inside call or call_indirect
experiment with different parameters in the tail-call...this may improve the quality of the generated code (i.e. make the function bodies smaller)
experiment with splitting out the opcodes from the opcode metadata. At the moment, as part of the parsing stage, we take the opcodes and add metadata to increase speed (i.e. we're not directly executing bytecode, rather we're executing / iterating over an in-memory representation of that bytecode) of execution. I've been wondering whether splitting the opcodes out separately from the metadata as it may have different cache performance characteristics (though that might be in a negative direction).
with this PR a jump table is created, so we indirectly jump to the instruction that is listed in the table. We should play around with instead of having a stream of opcodes, have a stream of function address and jump directly to those addresses.
iterate on Tail call interpreter dispatch + unified stack #118
try out the language features proposed in introduce labeled continue syntax inside a switch expression ziglang/zig#8220 when they land

- Working fibonacci - get rid of self.continuation (and other continuations)

…ptr, label_ptr, frame_ptr

- Working testsuite!

malcolmstill added 13 commits September 22, 2021 22:18

tail call with dispatch table: 26s -> 18s

efed933

WIP tail-calls

cb93522

- Working fibonacci - get rid of self.continuation (and other continuations)

Working fib in 13s!

dfa6c08

Remove continuation from block and loop instructions

d967ec6

Some cleaning

722974a

Replace op_stack / label_stack / frame_stack slices with integers op_…

d03aca1

…ptr, label_ptr, frame_ptr

Code slice as argument to tail call functions

057c5e8

fib compiles

249cd59

Fix for fib

f79a6d0

Replace memcpy

b167e72

Fixces

844d6fb

More fixes

60442e2

Fix instance switching + store instructions

05c5244

- Working testsuite!

malcolmstill changed the title ~~Malcolm/tail call 2~~ Tail-call interpreter dispatch Oct 4, 2021

malcolmstill added 2 commits October 4, 2021 19:56

Remove fib.zig

243f1d7

Rename break_target to branch_target

deca855

malcolmstill merged commit 4abcadc into master Oct 4, 2021

malcolmstill deleted the malcolm/tail-call-2 branch October 4, 2021 20:44

malcolmstill mentioned this pull request Oct 4, 2021

Optimise execution speed #114

Closed

malcolmstill mentioned this pull request Nov 14, 2021

Optimisation (execution) #131

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tail-call interpreter dispatch #120

Tail-call interpreter dispatch #120

malcolmstill commented Oct 3, 2021 •

edited

Tail-call interpreter dispatch #120

Tail-call interpreter dispatch #120

Conversation

malcolmstill commented Oct 3, 2021 • edited

Description

Performance

This PR

Previous master

Future work

malcolmstill commented Oct 3, 2021 •

edited