Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tail-call interpreter dispatch #120

Merged
merged 15 commits into from
Oct 4, 2021
Merged

Tail-call interpreter dispatch #120

merged 15 commits into from
Oct 4, 2021

Conversation

malcolmstill
Copy link
Owner

@malcolmstill malcolmstill commented Oct 3, 2021

Description

See #114 and ziglang/zig#8220

In comparison to #118, this PR:
a) works!
b) retains the original 3-stack system

#118 attempts to use more parameters in the tail-call functions in an effort to get the compiler to fix them in registers. To try and minimise the parameters I was trying to unify the 3 stacks into a single stack which would mix control flow and operands (kind of like the native stack). Whilst it is working to an extent (fib works) the testsuite isn't passing and the things that don't work I need to spend more time figuring out how to implement in the single-stack worldview.

#118 also tends to degrade the code quality, mostly through trying to get the stack pointer into the function parameters. Admittedly #118 is a bit faster than this PR, but I do prefer the code of this PR.

To reiterate:

  • we now have tail-call interpreter dispatch. This gives us a nice speed boost on fib(39), over 2x from 26 seconds to 12 seconds. I think this is mostly unrelated to branch prediction (which I will say more on) and boils down just to the quantity of code, with the compiler doing better with the separate functions, rather than one giant switch statement
  • the code quality mostly remains. Due to broken LLVM module found: musttail call must precede a ret with an optional bitcast ziglang/zig#5692, we lose our try and resort to passing in an optional pointer to a custom error union WasmError. Hopefully that issue gets fixed and we can go back to try, though there may be some reason why that won't be possible. On balance I can live with the optional pointer.

Performance

This PR

perf -d -d -d stat ./fib:

fib(39) = 63245986

 Performance counter stats for './fib':

         12,397.63 msec task-clock                #    1.000 CPUs utilized          
               131      context-switches          #   10.567 /sec                   
                 2      cpu-migrations            #    0.161 /sec                   
                24      page-faults               #    1.936 /sec                   
    39,424,853,935      cycles                    #    3.180 GHz                      (28.56%)
    10,592,820,823      stalled-cycles-frontend   #   26.87% frontend cycles idle     (28.56%)
    78,865,250,519      instructions              #    2.00  insn per cycle         
                                                  #    0.13  stalled cycles per insn  (35.71%)
     7,680,828,125      branches                  #  619.540 M/sec                    (35.72%)
       358,732,112      branch-misses             #    4.67% of all branches          (35.72%)
    35,630,774,927      L1-dcache-loads           #    2.874 G/sec                    (28.57%)
         2,375,000      L1-dcache-load-misses     #    0.01% of all L1-dcache accesses  (14.29%)
           393,521      LLC-loads                 #   31.742 K/sec                    (14.29%)
   <not supported>      LLC-load-misses                                             
   <not supported>      L1-icache-loads                                             
         2,973,606      L1-icache-load-misses                                         (21.44%)
    35,648,628,162      dTLB-loads                #    2.875 G/sec                    (21.42%)
           242,155      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (14.28%)
           144,986      iTLB-loads                #   11.695 K/sec                    (14.28%)
           145,246      iTLB-load-misses          #  100.18% of all iTLB cache accesses  (21.41%)
   <not supported>      L1-dcache-prefetches                                        
           385,581      L1-dcache-prefetch-misses #   31.101 K/sec                    (28.55%)

      12.399759040 seconds time elapsed

      12.304415000 seconds user
       0.002956000 seconds sys

Previous master

fib(39) = 63245986

 Performance counter stats for './fib':

         25,520.93 msec task-clock                #    1.000 CPUs utilized          
               113      context-switches          #    4.428 /sec                   
                10      cpu-migrations            #    0.392 /sec                   
                26      page-faults               #    1.019 /sec                   
    81,344,938,138      cycles                    #    3.187 GHz                      (28.57%)
    31,119,585,302      stalled-cycles-frontend   #   38.26% frontend cycles idle     (28.57%)
   128,751,238,578      instructions              #    1.58  insn per cycle         
                                                  #    0.24  stalled cycles per insn  (35.71%)
    16,658,200,203      branches                  #  652.727 M/sec                    (35.71%)
       809,936,529      branch-misses             #    4.86% of all branches          (35.71%)
    60,359,256,646      L1-dcache-loads           #    2.365 G/sec                    (28.56%)
         4,012,985      L1-dcache-load-misses     #    0.01% of all L1-dcache accesses  (14.29%)
           200,827      LLC-loads                 #    7.869 K/sec                    (14.29%)
   <not supported>      LLC-load-misses                                             
   <not supported>      L1-icache-loads                                             
         4,631,769      L1-icache-load-misses                                         (21.43%)
    60,383,842,949      dTLB-loads                #    2.366 G/sec                    (21.43%)
           356,086      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (14.29%)
           449,370      iTLB-loads                #   17.608 K/sec                    (14.29%)
       255,509,966      iTLB-load-misses          # 56859.60% of all iTLB cache accesses  (21.43%)
   <not supported>      L1-dcache-prefetches                                        
           422,578      L1-dcache-prefetch-misses #   16.558 K/sec                    (28.57%)

      25.522272275 seconds time elapsed

      25.356742000 seconds user
       0.000960000 seconds sys

PR for #118:

fib(39) = 63245986

 Performance counter stats for './fib':

         11,623.40 msec task-clock                #    1.000 CPUs utilized          
                51      context-switches          #    4.388 /sec                   
                 6      cpu-migrations            #    0.516 /sec                   
                28      page-faults               #    2.409 /sec                   
    37,004,194,045      cycles                    #    3.184 GHz                      (28.57%)
     9,688,790,344      stalled-cycles-frontend   #   26.18% frontend cycles idle     (28.57%)
    67,639,813,775      instructions              #    1.83  insn per cycle         
                                                  #    0.14  stalled cycles per insn  (35.72%)
     6,928,175,897      branches                  #  596.054 M/sec                    (35.73%)
       355,266,757      branch-misses             #    5.13% of all branches          (35.74%)
    32,694,252,761      L1-dcache-loads           #    2.813 G/sec                    (28.56%)
         1,729,924      L1-dcache-load-misses     #    0.01% of all L1-dcache accesses  (14.28%)
           249,298      LLC-loads                 #   21.448 K/sec                    (14.28%)
   <not supported>      LLC-load-misses                                             
   <not supported>      L1-icache-loads                                             
         2,342,573      L1-icache-load-misses                                         (21.42%)
    32,659,282,307      dTLB-loads                #    2.810 G/sec                    (21.41%)
           195,095      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (14.28%)
           144,515      iTLB-loads                #   12.433 K/sec                    (14.28%)
            96,345      iTLB-load-misses          #   66.67% of all iTLB cache accesses  (21.42%)
   <not supported>      L1-dcache-prefetches                                        
           177,433      L1-dcache-prefetch-misses #   15.265 K/sec                    (28.56%)

      11.624573685 seconds time elapsed

      11.543878000 seconds user
       0.000997000 seconds sys

My understanding of the above perfs:

  • This PR is twice as fast as master. Tail call interpreter dispatch + unified stack #118 is faster still (but not a huge amount)
  • It doesn't seem like master is suffering from branch-misprediction, despite having a single jump point. I.e. the branch-predictor of this machine isn't stupid.
  • None of them appear to suffer much from L1-cache-misses.
  • So more than anything else I think it just boils down to the amount of machine code executed for each variant. The instruction counts are roughly in proportion to the relative speed.
  • This implies that we will get more performance by optimising the generated machine code (i.e. trying to minimise the function bodies)

Future work

  • can we trim down the existing instruction functions...I feel like we're doing more work than is required of a validated wasm module (e.g. inside call or call_indirect
  • experiment with different parameters in the tail-call...this may improve the quality of the generated code (i.e. make the function bodies smaller)
  • experiment with splitting out the opcodes from the opcode metadata. At the moment, as part of the parsing stage, we take the opcodes and add metadata to increase speed (i.e. we're not directly executing bytecode, rather we're executing / iterating over an in-memory representation of that bytecode) of execution. I've been wondering whether splitting the opcodes out separately from the metadata as it may have different cache performance characteristics (though that might be in a negative direction).
  • with this PR a jump table is created, so we indirectly jump to the instruction that is listed in the table. We should play around with instead of having a stream of opcodes, have a stream of function address and jump directly to those addresses.
  • iterate on Tail call interpreter dispatch + unified stack #118
  • try out the language features proposed in introduce labeled continue syntax inside a switch expression ziglang/zig#8220 when they land

@malcolmstill malcolmstill changed the title Malcolm/tail call 2 Tail-call interpreter dispatch Oct 4, 2021
@malcolmstill malcolmstill merged commit 4abcadc into master Oct 4, 2021
@malcolmstill malcolmstill deleted the malcolm/tail-call-2 branch October 4, 2021 20:44
@malcolmstill malcolmstill mentioned this pull request Nov 14, 2021
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant