Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidy Android Instructions README.md #7016

Merged
merged 9 commits into from May 4, 2024
Merged

Conversation

Jeximo
Copy link
Contributor

@Jeximo Jeximo commented Apr 30, 2024

It's better to tidy readme regarding CLBlast instructions for Android.

Removed CLBlast instructions(outdated). Simplified Android CPU Build instructions.

Remove CLBlast instructions(outdated), added OpenBlas.
Added apt install git, so that git clone works
@slaren
Copy link
Collaborator

slaren commented Apr 30, 2024

Is OpenBLAS actually worth using in Android? For quantized models, it may be faster without it. Ultimately though, without the OpenCL instructions, this basically looks like "install termux and follow the normal build instructions for linux". So maybe it would be simpler that way.

@Jeximo
Copy link
Contributor Author

Jeximo commented May 1, 2024

Is OpenBLAS actually worth using in Android?

I like leaving the decision to the user if OpenBlas is worth it. I don't use it, but I don't prompt large(supposedly that's where it shines).

this basically looks like "install termux and follow the normal build instructions for linux". So maybe it would be simpler that way.

Agreed.

Linked to Linux build instructions
Remove word "run"
@teleprint-me
Copy link
Contributor

I build with OpenBLAS on Android, not that it matters. My chiming is, unfortunately, anecdotal. Is it really negligible? It's more difficult to tell on the phone if I'm being honest.

@slaren
Copy link
Collaborator

slaren commented May 1, 2024

The easiest way to tell if OpenBLAS helps would be to run llama-bench and look at the pp performance. BLAS is only used for prompts with at least 32 tokens.

@Jeximo
Copy link
Contributor Author

Jeximo commented May 1, 2024

CPU is definitely faster with quants on my device:
OpenBlas:

| model                          |       size |     params | backend    |    threads |    n_batch | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------: | ---------- | ---------------: |
| llama 8B IQ4_XS - 4.25 bpw     |   3.64 GiB |     7.24 B | BLAS       |          4 |         32 | pp 512     |      1.00 ± 0.00 |
build: a8f9b076 (2775)

CPU:

| model                          |     size   |     params | backend    |    threads |    n_batch | test       |              t/s |                 
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------: | ---------- | ---------------: |                 
| llama 8B IQ4_XS - 4.25 bpw     |   3.64 GiB |     7.24 B | CPU        |          4 |         32 | pp 512     |      3.12 ± 0.06 |                
build: a8f9b076 (2775)

@teleprint-me
Copy link
Contributor

I had to update, fix the convert script by adding the hash, and the upload the model I use, rebuild, and then download the quant. Plus, I have a bunch of other scripts running, so I'll post once it's all set.

@teleprint-me
Copy link
Contributor

CPU is much faster! Why is that?

~ $ ./llama.cpp/llama-bench -m models/stablelm-2-zephyr-1_6b.gguf -t 8
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| stablelm 1B F16 (guessed)      |   3.06 GiB |     1.64 B | BLAS       |          8 | pp 512     |     10.19 ± 1.87 |
| stablelm 1B F16 (guessed)      |   3.06 GiB |     1.64 B | BLAS       |          8 | tg 128     |      2.17 ± 0.17 |

build: a8f9b076 (2775)
~ $ ./llama.cpp/llama-bench -m models/stablelm-2-zephyr-1_6b.gguf -t 8
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| stablelm 1B F16 (guessed)      |   3.06 GiB |     1.64 B | CPU        |          8 | pp 512     |     32.35 ± 2.42 |
| stablelm 1B F16 (guessed)      |   3.06 GiB |     1.64 B | CPU        |          8 | tg 128     |      3.72 ± 0.79 |

build: a8f9b076 (2775)

@Jeximo
Copy link
Contributor Author

Jeximo commented May 2, 2024

CPU is much faster! Why is that?

I think libopenblas is not a full backend. Vulkan is the way forward for mobile GPU: #6395 (comment)

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Jeximo and others added 3 commits May 3, 2024 09:53
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Fdroid is not required

Co-authored-by: slaren <slarengh@gmail.com>
@Jeximo
Copy link
Contributor Author

Jeximo commented May 4, 2024

This should only affect the load time of the model though, but performance during inference should be the same.

Thank you. I'll try various options and post results later.

Co-authored-by: slaren <slarengh@gmail.com>
@slaren slaren merged commit cf768b7 into ggerganov:master May 4, 2024
21 checks passed
@Jeximo Jeximo deleted the patch-2 branch May 5, 2024 16:26
@Jeximo
Copy link
Contributor Author

Jeximo commented May 5, 2024

Tested --no-mmap on a model loaded frrom ~/ vs shared storage(Downloads). Performance is improved. It appears reduction is due to the combination of Android SAF API & mmap.

Here's some quick numbers, loading from shared:

llama_print_timings:        load time =   26232.51 ms
llama_print_timings:      sample time =      19.78 ms /    33 runs   (    0.60 ms per token,  1668.01 tokens per second)
llama_print_timings: prompt eval time =  186348.29 ms /    51 tokens ( 3653.89 ms per token,     0.27 tokens per second)
llama_print_timings:        eval time =  248449.20 ms /    32 runs   ( 7764.04 ms per token,     0.13 tokens per second)
llama_print_timings:       total time =  443161.08 ms /    83 tokens

load from shared & --no-mmap

llama_print_timings:        load time =   15297.21 ms
llama_print_timings:      sample time =      26.22 ms /    44 runs   (    0.60 ms per token,  1677.85 tokens per second)
llama_print_timings: prompt eval time =   54639.93 ms /    51 tokens ( 1071.37 ms per token,     0.93 tokens per second)
llama_print_timings:        eval time =   39760.87 ms /    43 runs   (  924.67 ms per token,     1.08 tokens per second)
llama_print_timings:       total time =   96297.49 ms /    94 tokens

load from ~/:

llama_print_timings:        load time =    6302.93 ms
llama_print_timings:      sample time =      32.26 ms /    54 runs   (    0.60 ms per token,  1673.85 tokens per second)
llama_print_timings: prompt eval time =   58406.42 ms /    51 tokens ( 1145.22 ms per token,     0.87 tokens per second)
llama_print_timings:        eval time =   48915.58 ms /    53 runs   (  922.94 ms per token,     1.08 tokens per second)
llama_print_timings:       total time =  108573.70 ms /   104 tokens

load from ~/ & --no-mmap:

llama_print_timings:        load time =    5184.56 ms
llama_print_timings:      sample time =      28.71 ms /    49 runs   (    0.59 ms per token,  1706.54 tokens per second)
llama_print_timings: prompt eval time =   46939.36 ms /    51 tokens (  920.38 ms per token,     1.09 tokens per second)
llama_print_timings:        eval time =   44217.39 ms /    48 runs   (  921.20 ms per token,     1.09 tokens per second)
llama_print_timings:       total time =   92946.78 ms /    99 tokens

Based on these figures, --no-mmap & ~/ is the best to load from. I used Meta-Llama-3-8B-Instruct-IQ3_M.gguf. I'll get a small model, and llama-bench later.

nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024
* Tidy Android Instructions README.md

Remove CLBlast instructions(outdated), added OpenBlas.

* don't assume git is installed

Added apt install git, so that git clone works

* removed OpenBlas

Linked to Linux build instructions

* fix typo

Remove word "run"

* correct style

Co-authored-by: slaren <slarengh@gmail.com>

* correct grammar

Co-authored-by: slaren <slarengh@gmail.com>

* delete reference to Android API

* remove Fdroid reference, link directly to Termux

Fdroid is not required

Co-authored-by: slaren <slarengh@gmail.com>

* Update README.md

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
@Jeximo
Copy link
Contributor Author

Jeximo commented May 6, 2024

Tested with TinyLlama-1.1B-Chat-v1.0-Q8_0.gguf using llama-bench. ./llama-bench -t 4 -p 512 -n 128 --mmap 0, --mmap 1

Load from shared, -m /data/data/com.termux/files/home/storage/downloads/TinyLlama-1.1B-Chat-v1.0-Q8_0.gguf

| model                          |       size |     params | backend    |    threads |       mmap | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------: | ---------- | ---------------: |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          0 | pp 512     |     22.82 ± 0.27 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          0 | tg 128     |     11.68 ± 0.23 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          1 | pp 512     |     22.30 ± 0.09 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          1 | tg 128     |     11.53 ± 0.21 |

build: 628b2991 (2794)

Load from ~/, ~/TinyLlama-1.1B-Chat-v1.0-Q8_0.gguf

| model                          |       size |     params | backend    |    threads |       mmap | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------: | ---------- | ---------------: |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          0 | pp 512     |     22.59 ± 0.22 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          0 | tg 128     |     11.54 ± 0.08 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          1 | pp 512     |     22.08 ± 0.08 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          1 | tg 128     |     11.28 ± 0.25 |

build: 628b2991 (2794)

The results are near identical. Probably Tiny Llama (1.09 GiB) is too small to emphasize difference for this test, even mmap made no difference. I'll leave larger model benching for someone with a better device than mine.

teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 7, 2024
* Tidy Android Instructions README.md

Remove CLBlast instructions(outdated), added OpenBlas.

* don't assume git is installed

Added apt install git, so that git clone works

* removed OpenBlas

Linked to Linux build instructions

* fix typo

Remove word "run"

* correct style

Co-authored-by: slaren <slarengh@gmail.com>

* correct grammar

Co-authored-by: slaren <slarengh@gmail.com>

* delete reference to Android API

* remove Fdroid reference, link directly to Termux

Fdroid is not required

Co-authored-by: slaren <slarengh@gmail.com>

* Update README.md

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
@gustrd
Copy link
Contributor

gustrd commented May 11, 2024

Hey everyone,

As the original author of these README instructions, I have to admit that I now see how they might cause more confusion than clarity.

Just to clarify for future users: I've personally found CLBlast to be quite effective when used with llama.cpp, especially for certain model families like StableLM and OpenLlama (provided you're not offloading layers). In my experience, it has boosted prompt processing speed by roughly 40%.

However, it's important to note that while CLBlast does offer significant speed improvements, it's plagued by bugs. For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output. This is disappointing, considering the untapped potential of the GPUs nestled within our smartphones.

If there's any way I can assist, I'd like to offer a few insights based on my experimentation:

  • In my tests, OpenBLAS consistently outperforms noblas, particularly for prompts exceeding 256 tokens.

  • The tip regarding disabling mmap on Android devices is a game-changer. I hadn't been aware of it previously, and it substantially accelerates prompt processing. I strongly advocate for emphasizing this point in the README.

Here's hoping that Vulkan proves to be a more robust solution than OpenGL.

@mofosyne mofosyne added documentation Improvements or additions to documentation review complexity : low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 12, 2024
@gpokat
Copy link

gpokat commented May 12, 2024

Hey everyone,

As the original author of these README instructions, I have to admit that I now see how they might cause more confusion than clarity.

Just to clarify for future users: I've personally found CLBlast to be quite effective when used with llama.cpp, especially for certain model families like StableLM and OpenLlama (provided you're not offloading layers). In my experience, it has boosted prompt processing speed by roughly 40%.

However, it's important to note that while CLBlast does offer significant speed improvements, it's plagued by bugs. For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output. This is disappointing, considering the untapped potential of the GPUs nestled within our smartphones.

If there's any way I can assist, I'd like to offer a few insights based on my experimentation:

  • In my tests, OpenBLAS consistently outperforms noblas, particularly for prompts exceeding 256 tokens.
  • The tip regarding disabling mmap on Android devices is a game-changer. I hadn't been aware of it previously, and it substantially accelerates prompt processing. I strongly advocate for emphasizing this point in the README.

Here's hoping that Vulkan proves to be a more robust solution than OpenGL.

Did your CLBLAST experience involve running corresponding tunners to achive speed for your device ?
Just for reference that in my experience with CLBLAST nonsensical infirence was fixed when I run and applied tunners. Hovewer on low end android device inference speed was the same as on cpu only without any loss in output.

@gustrd
Copy link
Contributor

gustrd commented May 13, 2024

Hey everyone,
As the original author of these README instructions, I have to admit that I now see how they might cause more confusion than clarity.
Just to clarify for future users: I've personally found CLBlast to be quite effective when used with llama.cpp, especially for certain model families like StableLM and OpenLlama (provided you're not offloading layers). In my experience, it has boosted prompt processing speed by roughly 40%.
However, it's important to note that while CLBlast does offer significant speed improvements, it's plagued by bugs. For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output. This is disappointing, considering the untapped potential of the GPUs nestled within our smartphones.
If there's any way I can assist, I'd like to offer a few insights based on my experimentation:

  • In my tests, OpenBLAS consistently outperforms noblas, particularly for prompts exceeding 256 tokens.
  • The tip regarding disabling mmap on Android devices is a game-changer. I hadn't been aware of it previously, and it substantially accelerates prompt processing. I strongly advocate for emphasizing this point in the README.

Here's hoping that Vulkan proves to be a more robust solution than OpenGL.

Did your CLBLAST experience involve running corresponding tunners to achive speed for your device ? Just for reference that in my experience with CLBLAST nonsensical infirence was fixed when I run and applied tunners. Hovewer on low end android device inference speed was the same as on cpu only without any loss in output.

No, I have not tried the tunners yet. Good idea, it's a nice experiment to do. Thanks for the idea!

@shibe2
Copy link
Collaborator

shibe2 commented May 13, 2024

For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output.

Is this specific to Android builds or can be reproduced on PC too?

@gustrd
Copy link
Contributor

gustrd commented May 13, 2024

For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output.

Is this specific to Android builds or can be reproduced on PC too?

As far as I know, it only happens during Android builds. All my tests were conducted with Adreno GPUs from Snapdragon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation review complexity : low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants