Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matmul tests don't seem to be representative (Julia vs Python experiments) #312

Open
no-trick-pony opened this issue Mar 15, 2021 · 7 comments

Comments

@no-trick-pony
Copy link

no-trick-pony commented Mar 15, 2021

The following tests were performed with:

Python 3.8.2 + numpy 1.20.1
Julia 1.5.3
Windows 10
Ryzen 7 1700

I am currently trying to figure out which language to use for my project which will include a lot of smaller matrix multiplications. I looked at Rust, Julia as well as Python. According to the Matmul benchmark here, Julia-native is twice as slow as Python-numpy with very little variance. That seemed kinda odd, since Julia was designed to be used in the scientific realm/Computer Science, where one would assume that linear algebra is a foundation for many things - if they found Julia to be that much slower, they surely would have opted to have a look at numpy and improve their own implementation. It's probably the one thing Julia should be somewhat good at. ^^ I tested the two scripts (which I modified to output on stdout) but had about the same results. But only for a very small matrix sizes and mismatched thread count.

Experiments:

So I tried it with 10000 as a parameter for the matrix sizing instead:

  • Julia used 50% CPU, 8 threads and took 17.5 seconds on average
  • Numpy used 100% CPU, (I assume) 16 threads and took 15.5 seconds on average

Hm. Why Julia is using 8 threads on my 16 thread CPU.. I have no idea and it should maybe be changed by the Julia team as well as the fact that BLAS is ignoring the --thread option and the thread environment variable but instead has to be manually set in code or the specific BLAS environment variable.

I changed the code to allow to specify the number of threads and tested again, for 4, 8 and 16 threads each (10000 again as a param). Note, that I ran every test only once. After the change to calc(n + (i % 2)) (see below) I reran the tests but the outcome was roughly the same in terms of relative time difference. The code in the brackets is how you would call the final code down below - parameters are matrix sizing, number of iterations and number of threads to use:

  • Julia, 4 threads: 30.9 s (julia matmul-native.jl 10000 1 4)
  • Numpy , 4 threads: 32.67s (python matmul-numpy.py 10000 1 4)
  • Julia, 8 threads: 17.7s (julia matmul-native.jl 10000 1 8)
  • Numpy , 8 threads: 18.6s (python matmul-numpy.py 10000 1 8)
  • Julia, 16 threads: 13.8s (julia matmul-native.jl 10000 1 16)
  • Python, 16 threads: 15.7s (python matmul-numpy.py 10000 1 16)

Then I looked into the Matmul Makefile and found out that 1500 was used there:

  • Julia, 4 threads: 0.11s (julia matmul-native.jl 1500 1 4)
  • Numpy , 4 threads: 0.14s (python matmul-numpy.py 1500 1 4)
  • Julia, 8 threads: 0.11s (julia matmul-native.jl 1500 1 8)
  • Numpy , 8 threads: 0.12s (python matmul-numpy.py 1500 1 8)
  • Julia, 16 threads: 0.084s (julia matmul-native.jl 1500 1 16)
  • Python, 16 threads: 0.095s (python matmul-numpy.py 1500 1 16)

So, for larger matrices, Julia appears to be faster, given the same thread count. Even for the size tested here, I get slightly faster computation with Julia. Weirdly enough my Python numbers weren't far off from the ones on the front page, but Julia on my machine performed far better than the roughly 200ms with 8 threads in the benchmarks here - for me it only took 110 ms. Probably something for the Julia people to look into(?)

But what I need for my project is lots of small multiplications. So I added yet another parameter to set a number of times calc() should be called in a loop. I changed calc(n) to calc(n + (i % 2)) to avoid optimization/unrolling with results getting summed up. I tested 100 as the matrix size:

  • Numpy, 1 thread, 100000 repetitions: 24.7s (python matmul-numpy.py 100 100000 1)
  • Julia, 1 thread, 100000 repetitions: 23.2s (julia matmul-native.jl 100 100000 1)
  • Numpy, 2 threads, 100000 repetitions: 22.1s (python matmul-numpy.py 100 100000 2)
  • Julia, 2 threads, 100000 repetitions: 17.1s (julia matmul-native.jl 100 100000 2)
  • Numpy, 4 threads, 100000 repetitions: 22.9s (python matmul-numpy.py 100 100000 4)
  • Julia, 4 threads, 100000 repetitions: 15.2s (julia matmul-native.jl 100 100000 4)
  • Numpy, 8 threads, 100000 repetitions: 26.3s (python matmul-numpy.py 100 100000 8)
  • Julia, 8 threads, 100000 repetitions: 20.4s (julia matmul-native.jl 100 100000 8)
  • Numpy, 16 threads, 100000 repetitions: 34.2s (python matmul-numpy.py 100 100000 16)
  • Julia, 16 threads, 100000 repetitions: 22.96s (julia matmul-native.jl 100 100000 16)

Yes, numpy and Julia with more threads were both slower than with less threads for me - my guess is that this is due to the thread creation/sleep overhead on such a small task. Julia peaked at 4 threads, numpy at 2.

Conclusion

I am unsure how valuable the current benchmarks for matrix multiplication is. There seems to be a small overhead for the first call in Julia and I would assume this to be the case for many other languages as well. The question I want to raise is: Is a single small-ish matrix multiplication in the realm of 100ms a benchmark that is of value to anyone, when successive calls and bigger multiplications are all much faster? From my point of view no, and it seems a bit misleading. I would thus suggest to perform multiple sub-benchmarks and post their results (and maybe a summed up number for the "overall duration") with bigger matrices+no repetition and smaller matrices with and without repetition. I think that would give better insights into how which language performs why.

One thing I am not sure about: According to the main README the CPU used in the benchmarks is a Intel(R) Core(TM) i7-10710U - that CPU has 12 threads. Why was Julia tested with 8 and not the full 12 threads (which e.g. numpy by default will use)? My guess is, it's because it just uses 8 threads by default but it wasn't manually set to 12 so it just used 8 instead, but I don't know for certain.

Code:

import socket
import os
import sys
from timeit import default_timer as timer

num_threads = int(sys.argv[3]) if len(sys.argv) > 3 else 8
os.environ["OMP_NUM_THREADS"] = str(num_threads)
import numpy as np

def build_matrix_np(n, seed):
    i_idxs = np.atleast_2d(np.arange(n)).T
    j_idxs = np.atleast_2d(np.arange(n))
    return (i_idxs - j_idxs) * (i_idxs + j_idxs) * (seed / n / n)


def matmul(a, b):
    return np.dot(a, b)


def calc(n):
    n = n // 2 * 2
    a = build_matrix_np(n, 1.0)
    b = build_matrix_np(n, 2.0)
    d = matmul(a, b)
    return d[n // 2][n // 2]

def testing():
    if __name__ == "__main__":
        n = int(sys.argv[1]) if len(sys.argv) > 1 else 100
        t = int(sys.argv[2]) if len(sys.argv) > 2 else 1
        
        left = calc(101)
        right = -18.67
        if abs(left - right) > 0.1:
            print("%f != %f" % (left, right), file=sys.stderr)
            quit(1)

        print(f"Python (NumPy, {num_threads} threads)\t{os.getpid()}")
        results = 0
        start = timer()
        for i in range(t):
            results += calc(n + (i % 2))
        elapsed = timer() - start
        print("stop")
        print(results)
        print(elapsed)

testing()

and

using LinearAlgebra

function matgen(n, seed)
    tmp = seed / n / n
    [tmp * (i - j) * (i + j - 2) for i = 1:n, j = 1:n]
end

function calc(n)
    n = round(Int, n / 2) * 2
    a = matgen(n, 1.0)
    b = matgen(n, 2.0)
    c = a * b
    c[Int(n / 2)+1, Int(n / 2)+1]
end


function testing()
    if abspath(PROGRAM_FILE) == @__FILE__
        n = length(ARGS) > 0 ? parse(Int, ARGS[1]) : 100
        t = length(ARGS) > 1 ? parse(Int, ARGS[2]) : 1
        num_threads = length(ARGS) > 2 ? parse(Int, ARGS[3]) : 8
        
        BLAS.set_num_threads(num_threads)
        
        left = calc(101)
        right = -18.67
        if abs(left - right) > 0.1
            println(stderr, "$(left) != $(right)")
            exit(1)
        end

        # Assuming openblas64, may not work for all environments
        println("Julia (threads: $(num_threads))\t$(getpid())")

        results = 0
        start = time()
        for i in 1:t
            results += calc(n + (i % 2))
        end
        elapsed = time() - start

        println("stop")
        println(results)
        println("time: $(elapsed) s")
    end
end


testing()
@kostya
Copy link
Owner

kostya commented Mar 15, 2021

Hi, this is very interesting results, you should post this as article somewhere(on reddit for example). I think multithread comparison out of this project, and multithreaded versions was added just as additional comparisons.

@nuald
Copy link
Collaborator

nuald commented Mar 15, 2021

You've raised a lot of questions here, but I'm not sure we want to go deeper here because there is confusion about the benchmarks. After we resolve it, we can discuss the matters in details.

We have the reference implementations there (without any libraries usage) - those are the real benchmarks of the languages (and you could easily see that Julia even without BLAS is much faster than Python). However, in real life we use some libraries, that's why they're included, but only as samples. And the proper comparison regarding that would be not only about Python+NumPy vs Julia+OpenBLAS with the different inputs and thread counts, but also:

As you see, one would need a special benchmark just for the linear algebra comparisons.

However, I think you're wondering why Julia+OpenBLAS was slower than Python+NumPy. First, let me note that the number of threads doesn't matter (I've tested both examples with OPENBLAS_NUM_THREADS=X, and Python was always faster). But you could miss that the benchmark consist of 2 parts:

  1. Generating matrices.
  2. Multiplying them.

Python uses special NumPy API, and Julia doesn't use anything, the matrices are created as usual arrays. It means that Julia has overhead for communicating with OpenBLAS as it needs to transform the Julia's arrays into OpenBLAS input data. If you use the proper API, I think that Julia will have better numbers (that's my educated guess though, and if you're willing to look into that, we'd gladly accept PR with the optimized matmul-native.jl code). If my guess is wrong, we can look further (but first, we need to fix that).

@machineko
Copy link
Contributor

machineko commented Mar 26, 2021

Also, Julia uses a JIT compiler, so it might be good to run the code once before starting the actual benchmark so that it can be compiled.

@jtrakk You should check the code before commenting it's already using warm start.

    left = calc(101)
    right = -18.67
    if abs(left - right) > 0.1
        println(stderr, "$(left) != $(right)")
        exit(1)
    end

    notify("Julia (no BLAS)\t$(getpid())")
    t = time()
    results = calc(n)
    elapsed = time() - t
    notify("stop")

@dumblob
Copy link

dumblob commented Sep 11, 2021

Not exactly to the discussion, but closely related to the topic.

I've just now spotted, that unlike many other languages (Rust, V, Nim, etc.) which do bounds checking, the solution in matmul.jl does turn off bounds checking. I think it's unfair especially considering that the timing results for matmul.jl are much faster than these other languages 😉.

Any reasons bounds checking is turned off in Julia?

(other languages also support directives and means to suppress bounds checking - but these are not considered "standard practice" in their ecosystems even when doing scientific calculations - I'd also guess Julia ecosystem also doesn't encourage turning off bounds checking)

@kostya
Copy link
Owner

kostya commented Sep 11, 2021

i think bound checking should be enabled for all languages, as it likely would be enabled in production. but instead we just should use language constructs to avoid checkings, like iterators.

@nuald
Copy link
Collaborator

nuald commented Sep 13, 2021

There is a reason why we don't have a consistency here - because of "cheating" underlying libraries for some languages. We may have safe bounds-checking and/or iterator based code, but it will never beat NumPy, OpenBLAS etc as they have quite unsafe and aggressively optimized C and Fortran code.

Therefore, there is no way we can have the fair comparison here, so I allowed to skip bounds checking if possible. If we introduce it, then we have to rewrite the existing code (for example, use C++'s at instead of []), and that may involve a lot of work (and would contradict the requirements of having the standard code, not specially adapted for the tests). Iterator-based approach could work though, but I won't say it would be easy to comprehend (https://www.reidatcheson.com/matrix%20multiplication/rust/iterators/2021/02/26/gemm-iterators.html), and I doubt it could be considered as standard code written by average developer.

@dumblob
Copy link

dumblob commented Sep 13, 2021

We may have safe bounds-checking and/or iterator based code, but it will never beat NumPy, OpenBLAS etc as they have quite unsafe and aggressively optimized C and Fortran code.

This is IMHO totally fine and understandable. I was just surprised by explicit disallowance of bounds checking in Julia (an average developer incl. myself wouldn't turn off bounds checking in this case IMHO), nothing more, nothing less 😉.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants