Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use AVX2 in minmax_element vectorization #4659

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

AlexGuteniev
Copy link
Contributor

Resolves #2803

This is not final optimization. At least, we should use AVX masks here too.
But this one is complex enough already, so the rest would be follow-up PR(s).

I also notice that Both_val 8-bit and 16-bit cases are slow.
The vectorization for them is not engaged, it is a separate issue from the AVX.

Benchmark results Before:
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min>              248 ns         96.3 ns      7466667
bm<uint8_t, 8021, Op::Max>              238 ns         96.9 ns      6291661
bm<uint8_t, 8021, Op::Both>             371 ns          124 ns      8432941
bm<uint8_t, 8021, Op::Min_val>          131 ns         56.6 ns     16000000
bm<uint8_t, 8021, Op::Max_val>          128 ns         67.0 ns      7466667
bm<uint8_t, 8021, Op::Both_val>        4197 ns         1611 ns       407273
bm<uint16_t, 8021, Op::Min>             459 ns          226 ns      4072727
bm<uint16_t, 8021, Op::Max>             445 ns          200 ns      3521123
bm<uint16_t, 8021, Op::Both>            685 ns          331 ns      2357895
bm<uint16_t, 8021, Op::Min_val>         247 ns          122 ns      4480000
bm<uint16_t, 8021, Op::Max_val>         252 ns          105 ns      7466667
bm<uint16_t, 8021, Op::Both_val>       4239 ns         3299 ns       203636
bm<uint32_t, 8021, Op::Min>             979 ns          364 ns      1544828
bm<uint32_t, 8021, Op::Max>             932 ns          586 ns      1120000
bm<uint32_t, 8021, Op::Both>           1439 ns          921 ns       746667
bm<uint32_t, 8021, Op::Min_val>         501 ns          321 ns      1947826
bm<uint32_t, 8021, Op::Max_val>         494 ns          414 ns      2036364
bm<uint32_t, 8021, Op::Both_val>        673 ns          500 ns      1000000
bm<uint64_t, 8021, Op::Min>            4252 ns         3128 ns       344615
bm<uint64_t, 8021, Op::Max>            4360 ns         2567 ns       280000
bm<uint64_t, 8021, Op::Both>           4397 ns         2783 ns       213333
bm<uint64_t, 8021, Op::Min_val>        3844 ns         2441 ns       320000
bm<uint64_t, 8021, Op::Max_val>        3857 ns         2518 ns       235789
bm<uint64_t, 8021, Op::Both_val>       3862 ns         2319 ns       235789
bm<int8_t, 8021, Op::Min>               246 ns          176 ns      3733333
bm<int8_t, 8021, Op::Max>               235 ns          179 ns      4977778
bm<int8_t, 8021, Op::Both>              361 ns          239 ns      3200000
bm<int8_t, 8021, Op::Min_val>           126 ns         89.3 ns     11200000
bm<int8_t, 8021, Op::Max_val>           128 ns         94.2 ns      7466667
bm<int8_t, 8021, Op::Both_val>         3842 ns         2441 ns       224000
bm<int16_t, 8021, Op::Min>              460 ns          321 ns      2968994
bm<int16_t, 8021, Op::Max>              445 ns          308 ns      2133333
bm<int16_t, 8021, Op::Both>             683 ns          517 ns      1723077
bm<int16_t, 8021, Op::Min_val>          251 ns          176 ns      6400000
bm<int16_t, 8021, Op::Max_val>          249 ns          174 ns      5575111
bm<int16_t, 8021, Op::Both_val>        3318 ns         1709 ns       320000
bm<int32_t, 8021, Op::Min>              965 ns          719 ns      1000000
bm<int32_t, 8021, Op::Max>              903 ns          628 ns      1120000
bm<int32_t, 8021, Op::Both>            1405 ns          893 ns      1120000
bm<int32_t, 8021, Op::Min_val>          497 ns          307 ns      2036364
bm<int32_t, 8021, Op::Max_val>          505 ns          377 ns      1947826
bm<int32_t, 8021, Op::Both_val>         690 ns          401 ns      1792000
bm<int64_t, 8021, Op::Min>             4466 ns         3024 ns       263529
bm<int64_t, 8021, Op::Max>             4385 ns         2860 ns       224000
bm<int64_t, 8021, Op::Both>            4845 ns         2567 ns       280000
bm<int64_t, 8021, Op::Min_val>         5156 ns         1883 ns      1120000
bm<int64_t, 8021, Op::Max_val>         4003 ns         1664 ns       497778
bm<int64_t, 8021, Op::Both_val>        3847 ns         2052 ns       464593
bm<float, 8021, Op::Min>               1965 ns          928 ns       640000
bm<float, 8021, Op::Max>               2014 ns          949 ns       560000
bm<float, 8021, Op::Both>              2254 ns         1067 ns       746667
bm<float, 8021, Op::Min_val>           1870 ns          767 ns      1120000
bm<float, 8021, Op::Max_val>           1838 ns          984 ns       746667
bm<float, 8021, Op::Both_val>          1886 ns          894 ns       768981
bm<double, 8021, Op::Min>              3931 ns         2009 ns       373333
bm<double, 8021, Op::Max>              4004 ns         1646 ns       560000
bm<double, 8021, Op::Both>             4809 ns         1842 ns       280000
bm<double, 8021, Op::Min_val>          3776 ns         1855 ns       320000
bm<double, 8021, Op::Max_val>          3769 ns         1807 ns       320000
bm<double, 8021, Op::Both_val>         3850 ns         1674 ns       448000

After:

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min>              170 ns         80.2 ns      8960000
bm<uint8_t, 8021, Op::Max>              170 ns         82.3 ns     11200000
bm<uint8_t, 8021, Op::Both>             295 ns         85.0 ns      7719385
bm<uint8_t, 8021, Op::Min_val>         76.7 ns         26.2 ns     29866667
bm<uint8_t, 8021, Op::Max_val>         75.4 ns         28.0 ns     40727273
bm<uint8_t, 8021, Op::Both_val>        4312 ns         1283 ns       560000
bm<uint16_t, 8021, Op::Min>             336 ns          157 ns      8960000
bm<uint16_t, 8021, Op::Max>             318 ns          146 ns      7466667
bm<uint16_t, 8021, Op::Both>            553 ns          231 ns      3446154
bm<uint16_t, 8021, Op::Min_val>         142 ns         69.8 ns     22400000
bm<uint16_t, 8021, Op::Max_val>         139 ns         68.8 ns     10000000
bm<uint16_t, 8021, Op::Both_val>       4306 ns         2574 ns       248889
bm<uint32_t, 8021, Op::Min>             615 ns          363 ns      3009055
bm<uint32_t, 8021, Op::Max>             623 ns          325 ns      2357895
bm<uint32_t, 8021, Op::Both>           1071 ns          609 ns      1000000
bm<uint32_t, 8021, Op::Min_val>         258 ns         64.3 ns     12389136
bm<uint32_t, 8021, Op::Max_val>         258 ns         58.6 ns     11200000
bm<uint32_t, 8021, Op::Both_val>        374 ns          100 ns     10000000
bm<uint64_t, 8021, Op::Min>            3540 ns          663 ns       896000
bm<uint64_t, 8021, Op::Max>            3468 ns          922 ns      1000000
bm<uint64_t, 8021, Op::Both>           4271 ns          907 ns      1120000
bm<uint64_t, 8021, Op::Min_val>        2917 ns          680 ns       896000
bm<uint64_t, 8021, Op::Max_val>        2974 ns          797 ns      1000000
bm<uint64_t, 8021, Op::Both_val>       3090 ns          893 ns      1120000
bm<int8_t, 8021, Op::Min>               177 ns         38.1 ns     16000000
bm<int8_t, 8021, Op::Max>               177 ns         48.1 ns     14933333
bm<int8_t, 8021, Op::Both>              288 ns         85.4 ns      6400000
bm<int8_t, 8021, Op::Min_val>          77.4 ns         29.9 ns     20363636
bm<int8_t, 8021, Op::Max_val>          74.1 ns         22.6 ns     37333333
bm<int8_t, 8021, Op::Both_val>         3843 ns         1086 ns       719369
bm<int16_t, 8021, Op::Min>              339 ns         69.8 ns      8960000
bm<int16_t, 8021, Op::Max>              321 ns         73.4 ns     10000000
bm<int16_t, 8021, Op::Both>             553 ns          141 ns     11200000
bm<int16_t, 8021, Op::Min_val>          140 ns         49.8 ns     16000000
bm<int16_t, 8021, Op::Max_val>          139 ns         45.6 ns     20906668
bm<int16_t, 8021, Op::Both_val>        3377 ns          628 ns       746667
bm<int32_t, 8021, Op::Min>              620 ns          178 ns      7466667
bm<int32_t, 8021, Op::Max>              625 ns          159 ns      5120000
bm<int32_t, 8021, Op::Both>            1059 ns          261 ns      2635294
bm<int32_t, 8021, Op::Min_val>          254 ns         62.5 ns     10000000
bm<int32_t, 8021, Op::Max_val>          254 ns         64.5 ns      8960000
bm<int32_t, 8021, Op::Both_val>         379 ns          100 ns     11200000
bm<int64_t, 8021, Op::Min>             3532 ns         1287 ns       497778
bm<int64_t, 8021, Op::Max>             3462 ns         1221 ns       448000
bm<int64_t, 8021, Op::Both>            4130 ns         1953 ns       640000
bm<int64_t, 8021, Op::Min_val>         3011 ns         1726 ns       298667
bm<int64_t, 8021, Op::Max_val>         2945 ns         1632 ns       497778
bm<int64_t, 8021, Op::Both_val>        3264 ns         1378 ns       669014
bm<float, 8021, Op::Min>               1176 ns          401 ns      1947826
bm<float, 8021, Op::Max>               1208 ns          750 ns      1000000
bm<float, 8021, Op::Both>              1358 ns         1151 ns       746667
bm<float, 8021, Op::Min_val>            894 ns          802 ns       896000
bm<float, 8021, Op::Max_val>            891 ns          310 ns      4683093
bm<float, 8021, Op::Both_val>           949 ns          844 ns      1000000
bm<double, 8021, Op::Min>              2230 ns         1918 ns       448000
bm<double, 8021, Op::Max>              2421 ns         2134 ns       373333
bm<double, 8021, Op::Both>             2765 ns         2532 ns       407273
bm<double, 8021, Op::Min_val>          1860 ns         1674 ns       448000
bm<double, 8021, Op::Max_val>          1869 ns         1500 ns       448000
bm<double, 8021, Op::Both_val>         1939 ns         1381 ns       407273

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner May 7, 2024 07:24
@AlexGuteniev
Copy link
Contributor Author

I also dropped the bool _Unused.

With extra dispatcher, the inlining decisions are different. Now, the dispatcher is inlined into the exported functions, along with the scalar implementation. The vector implementations are tail called, and signature variations are not likely to prevent that.

@StephanTLavavej StephanTLavavej added the performance Must go faster label May 7, 2024
@StephanTLavavej StephanTLavavej self-assigned this May 7, 2024
@AlexGuteniev
Copy link
Contributor Author

Results as a table

Benchmark Before After
bm<uint8_t, 8021, Op::Min> 248 ns 170 ns
bm<uint8_t, 8021, Op::Max> 238 ns 170 ns
bm<uint8_t, 8021, Op::Both> 371 ns 295 ns
bm<uint8_t, 8021, Op::Min_val> 131 ns 76.7 ns
bm<uint8_t, 8021, Op::Max_val> 128 ns 75.4 ns
bm<uint8_t, 8021, Op::Both_val> 4197 ns 4312 ns
bm<uint16_t, 8021, Op::Min> 459 ns 336 ns
bm<uint16_t, 8021, Op::Max> 445 ns 318 ns
bm<uint16_t, 8021, Op::Both> 685 ns 553 ns
bm<uint16_t, 8021, Op::Min_val> 247 ns 142 ns
bm<uint16_t, 8021, Op::Max_val> 252 ns 139 ns
bm<uint16_t, 8021, Op::Both_val> 4239 ns 4306 ns
bm<uint32_t, 8021, Op::Min> 979 ns 615 ns
bm<uint32_t, 8021, Op::Max> 932 ns 623 ns
bm<uint32_t, 8021, Op::Both> 1439 ns 1071 ns
bm<uint32_t, 8021, Op::Min_val> 501 ns 258 ns
bm<uint32_t, 8021, Op::Max_val> 494 ns 258 ns
bm<uint32_t, 8021, Op::Both_val> 673 ns 374 ns
bm<uint64_t, 8021, Op::Min> 4252 ns 3540 ns
bm<uint64_t, 8021, Op::Max> 4360 ns 3468 ns
bm<uint64_t, 8021, Op::Both> 4397 ns 4271 ns
bm<uint64_t, 8021, Op::Min_val> 3844 ns 2917 ns
bm<uint64_t, 8021, Op::Max_val> 3857 ns 2974 ns
bm<uint64_t, 8021, Op::Both_val> 3862 ns 3090 ns
bm<int8_t, 8021, Op::Min> 246 ns 177 ns
bm<int8_t, 8021, Op::Max> 235 ns 177 ns
bm<int8_t, 8021, Op::Both> 361 ns 288 ns
bm<int8_t, 8021, Op::Min_val> 126 ns 77.4 ns
bm<int8_t, 8021, Op::Max_val> 128 ns 74.1 ns
bm<int8_t, 8021, Op::Both_val> 3842 ns 3843 ns
bm<int16_t, 8021, Op::Min> 460 ns 339 ns
bm<int16_t, 8021, Op::Max> 445 ns 321 ns
bm<int16_t, 8021, Op::Both> 683 ns 553 ns
bm<int16_t, 8021, Op::Min_val> 251 ns 140 ns
bm<int16_t, 8021, Op::Max_val> 249 ns 139 ns
bm<int16_t, 8021, Op::Both_val> 3318 ns 3377 ns
bm<int32_t, 8021, Op::Min> 965 ns 620 ns
bm<int32_t, 8021, Op::Max> 903 ns 625 ns
bm<int32_t, 8021, Op::Both> 1405 ns 1059 ns
bm<int32_t, 8021, Op::Min_val> 497 ns 254 ns
bm<int32_t, 8021, Op::Max_val> 505 ns 254 ns
bm<int32_t, 8021, Op::Both_val> 690 ns 379 ns
bm<int64_t, 8021, Op::Min> 4466 ns 3532 ns
bm<int64_t, 8021, Op::Max> 4385 ns 3462 ns
bm<int64_t, 8021, Op::Both> 4845 ns 4130 ns
bm<int64_t, 8021, Op::Min_val> 5156 ns 3011 ns
bm<int64_t, 8021, Op::Max_val> 4003 ns 2945 ns
bm<int64_t, 8021, Op::Both_val> 3847 ns 3264 ns
bm<float, 8021, Op::Min> 1965 ns 1176 ns
bm<float, 8021, Op::Max> 2014 ns 1208 ns
bm<float, 8021, Op::Both> 2254 ns 1358 ns
bm<float, 8021, Op::Min_val> 1870 ns 894 ns
bm<float, 8021, Op::Max_val> 1838 ns 891 ns
bm<float, 8021, Op::Both_val> 1886 ns 949 ns
bm<double, 8021, Op::Min> 3931 ns 2230 ns
bm<double, 8021, Op::Max> 4004 ns 2421 ns
bm<double, 8021, Op::Both> 4809 ns 2765 ns
bm<double, 8021, Op::Min_val> 3776 ns 1860 ns
bm<double, 8021, Op::Max_val> 3769 ns 1869 ns
bm<double, 8021, Op::Both_val> 3850 ns 1939 ns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Status: Initial Review
Development

Successfully merging this pull request may close these issues.

More minmax vectorization
2 participants