Tidy Android Instructions README.md #7016

Jeximo · 2024-04-30T23:40:29Z

It's better to tidy readme regarding CLBlast instructions for Android.

Removed CLBlast instructions(outdated). Simplified Android CPU Build instructions.

Remove CLBlast instructions(outdated), added OpenBlas.

Added apt install git, so that git clone works

slaren · 2024-04-30T23:56:04Z

Is OpenBLAS actually worth using in Android? For quantized models, it may be faster without it. Ultimately though, without the OpenCL instructions, this basically looks like "install termux and follow the normal build instructions for linux". So maybe it would be simpler that way.

Jeximo · 2024-05-01T00:14:33Z

Is OpenBLAS actually worth using in Android?

I like leaving the decision to the user if OpenBlas is worth it. I don't use it, but I don't prompt large(supposedly that's where it shines).

this basically looks like "install termux and follow the normal build instructions for linux". So maybe it would be simpler that way.

Agreed.

Linked to Linux build instructions

Remove word "run"

teleprint-me · 2024-05-01T00:28:21Z

I build with OpenBLAS on Android, not that it matters. My chiming is, unfortunately, anecdotal. Is it really negligible? It's more difficult to tell on the phone if I'm being honest.

slaren · 2024-05-01T00:29:57Z

The easiest way to tell if OpenBLAS helps would be to run llama-bench and look at the pp performance. BLAS is only used for prompts with at least 32 tokens.

Jeximo · 2024-05-01T03:26:26Z

CPU is definitely faster with quants on my device:
OpenBlas:

| model                          |       size |     params | backend    |    threads |    n_batch | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------: | ---------- | ---------------: |
| llama 8B IQ4_XS - 4.25 bpw     |   3.64 GiB |     7.24 B | BLAS       |          4 |         32 | pp 512     |      1.00 ± 0.00 |
build: a8f9b076 (2775)

CPU:

| model                          |     size   |     params | backend    |    threads |    n_batch | test       |              t/s |                 
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------: | ---------- | ---------------: |                 
| llama 8B IQ4_XS - 4.25 bpw     |   3.64 GiB |     7.24 B | CPU        |          4 |         32 | pp 512     |      3.12 ± 0.06 |                
build: a8f9b076 (2775)

teleprint-me · 2024-05-01T04:47:02Z

I had to update, fix the convert script by adding the hash, and the upload the model I use, rebuild, and then download the quant. Plus, I have a bunch of other scripts running, so I'll post once it's all set.

teleprint-me · 2024-05-01T14:05:38Z

CPU is much faster! Why is that?

~ $ ./llama.cpp/llama-bench -m models/stablelm-2-zephyr-1_6b.gguf -t 8
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| stablelm 1B F16 (guessed)      |   3.06 GiB |     1.64 B | BLAS       |          8 | pp 512     |     10.19 ± 1.87 |
| stablelm 1B F16 (guessed)      |   3.06 GiB |     1.64 B | BLAS       |          8 | tg 128     |      2.17 ± 0.17 |

build: a8f9b076 (2775)
~ $ ./llama.cpp/llama-bench -m models/stablelm-2-zephyr-1_6b.gguf -t 8
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| stablelm 1B F16 (guessed)      |   3.06 GiB |     1.64 B | CPU        |          8 | pp 512     |     32.35 ± 2.42 |
| stablelm 1B F16 (guessed)      |   3.06 GiB |     1.64 B | CPU        |          8 | tg 128     |      3.72 ± 0.79 |

build: a8f9b076 (2775)

Jeximo · 2024-05-02T18:23:32Z

CPU is much faster! Why is that?

I think libopenblas is not a full backend. Vulkan is the way forward for mobile GPU: #6395 (comment)

README.md

Co-authored-by: slaren <slarengh@gmail.com>

README.md

Fdroid is not required Co-authored-by: slaren <slarengh@gmail.com>

Jeximo · 2024-05-04T15:33:18Z

This should only affect the load time of the model though, but performance during inference should be the same.

Thank you. I'll try various options and post results later.

Co-authored-by: slaren <slarengh@gmail.com>

Jeximo · 2024-05-05T16:44:03Z

Tested --no-mmap on a model loaded frrom ~/ vs shared storage(Downloads). Performance is improved. It appears reduction is due to the combination of Android SAF API & mmap.

Here's some quick numbers, loading from shared:

llama_print_timings:        load time =   26232.51 ms
llama_print_timings:      sample time =      19.78 ms /    33 runs   (    0.60 ms per token,  1668.01 tokens per second)
llama_print_timings: prompt eval time =  186348.29 ms /    51 tokens ( 3653.89 ms per token,     0.27 tokens per second)
llama_print_timings:        eval time =  248449.20 ms /    32 runs   ( 7764.04 ms per token,     0.13 tokens per second)
llama_print_timings:       total time =  443161.08 ms /    83 tokens

load from shared & --no-mmap

llama_print_timings:        load time =   15297.21 ms
llama_print_timings:      sample time =      26.22 ms /    44 runs   (    0.60 ms per token,  1677.85 tokens per second)
llama_print_timings: prompt eval time =   54639.93 ms /    51 tokens ( 1071.37 ms per token,     0.93 tokens per second)
llama_print_timings:        eval time =   39760.87 ms /    43 runs   (  924.67 ms per token,     1.08 tokens per second)
llama_print_timings:       total time =   96297.49 ms /    94 tokens

load from ~/:

llama_print_timings:        load time =    6302.93 ms
llama_print_timings:      sample time =      32.26 ms /    54 runs   (    0.60 ms per token,  1673.85 tokens per second)
llama_print_timings: prompt eval time =   58406.42 ms /    51 tokens ( 1145.22 ms per token,     0.87 tokens per second)
llama_print_timings:        eval time =   48915.58 ms /    53 runs   (  922.94 ms per token,     1.08 tokens per second)
llama_print_timings:       total time =  108573.70 ms /   104 tokens

load from ~/ & --no-mmap:

llama_print_timings:        load time =    5184.56 ms
llama_print_timings:      sample time =      28.71 ms /    49 runs   (    0.59 ms per token,  1706.54 tokens per second)
llama_print_timings: prompt eval time =   46939.36 ms /    51 tokens (  920.38 ms per token,     1.09 tokens per second)
llama_print_timings:        eval time =   44217.39 ms /    48 runs   (  921.20 ms per token,     1.09 tokens per second)
llama_print_timings:       total time =   92946.78 ms /    99 tokens

Based on these figures, --no-mmap & ~/ is the best to load from. I used Meta-Llama-3-8B-Instruct-IQ3_M.gguf. I'll get a small model, and llama-bench later.

* Tidy Android Instructions README.md Remove CLBlast instructions(outdated), added OpenBlas. * don't assume git is installed Added apt install git, so that git clone works * removed OpenBlas Linked to Linux build instructions * fix typo Remove word "run" * correct style Co-authored-by: slaren <slarengh@gmail.com> * correct grammar Co-authored-by: slaren <slarengh@gmail.com> * delete reference to Android API * remove Fdroid reference, link directly to Termux Fdroid is not required Co-authored-by: slaren <slarengh@gmail.com> * Update README.md Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>

Jeximo · 2024-05-06T00:37:06Z

Tested with TinyLlama-1.1B-Chat-v1.0-Q8_0.gguf using llama-bench. ./llama-bench -t 4 -p 512 -n 128 --mmap 0, --mmap 1

Load from shared, -m /data/data/com.termux/files/home/storage/downloads/TinyLlama-1.1B-Chat-v1.0-Q8_0.gguf

| model                          |       size |     params | backend    |    threads |       mmap | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------: | ---------- | ---------------: |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          0 | pp 512     |     22.82 ± 0.27 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          0 | tg 128     |     11.68 ± 0.23 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          1 | pp 512     |     22.30 ± 0.09 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          1 | tg 128     |     11.53 ± 0.21 |

build: 628b2991 (2794)

Load from ~/, ~/TinyLlama-1.1B-Chat-v1.0-Q8_0.gguf

| model                          |       size |     params | backend    |    threads |       mmap | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------: | ---------- | ---------------: |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          0 | pp 512     |     22.59 ± 0.22 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          0 | tg 128     |     11.54 ± 0.08 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          1 | pp 512     |     22.08 ± 0.08 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          1 | tg 128     |     11.28 ± 0.25 |

build: 628b2991 (2794)

The results are near identical. Probably Tiny Llama (1.09 GiB) is too small to emphasize difference for this test, even mmap made no difference. I'll leave larger model benching for someone with a better device than mine.

* Tidy Android Instructions README.md Remove CLBlast instructions(outdated), added OpenBlas. * don't assume git is installed Added apt install git, so that git clone works * removed OpenBlas Linked to Linux build instructions * fix typo Remove word "run" * correct style Co-authored-by: slaren <slarengh@gmail.com> * correct grammar Co-authored-by: slaren <slarengh@gmail.com> * delete reference to Android API * remove Fdroid reference, link directly to Termux Fdroid is not required Co-authored-by: slaren <slarengh@gmail.com> * Update README.md Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>

gustrd · 2024-05-11T23:59:24Z

Hey everyone,

As the original author of these README instructions, I have to admit that I now see how they might cause more confusion than clarity.

Just to clarify for future users: I've personally found CLBlast to be quite effective when used with llama.cpp, especially for certain model families like StableLM and OpenLlama (provided you're not offloading layers). In my experience, it has boosted prompt processing speed by roughly 40%.

However, it's important to note that while CLBlast does offer significant speed improvements, it's plagued by bugs. For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output. This is disappointing, considering the untapped potential of the GPUs nestled within our smartphones.

If there's any way I can assist, I'd like to offer a few insights based on my experimentation:

In my tests, OpenBLAS consistently outperforms noblas, particularly for prompts exceeding 256 tokens.
The tip regarding disabling mmap on Android devices is a game-changer. I hadn't been aware of it previously, and it substantially accelerates prompt processing. I strongly advocate for emphasizing this point in the README.

Here's hoping that Vulkan proves to be a more robust solution than OpenGL.

gpokat · 2024-05-12T03:21:50Z

Hey everyone,

As the original author of these README instructions, I have to admit that I now see how they might cause more confusion than clarity.

Just to clarify for future users: I've personally found CLBlast to be quite effective when used with llama.cpp, especially for certain model families like StableLM and OpenLlama (provided you're not offloading layers). In my experience, it has boosted prompt processing speed by roughly 40%.

However, it's important to note that while CLBlast does offer significant speed improvements, it's plagued by bugs. For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output. This is disappointing, considering the untapped potential of the GPUs nestled within our smartphones.

If there's any way I can assist, I'd like to offer a few insights based on my experimentation:

In my tests, OpenBLAS consistently outperforms noblas, particularly for prompts exceeding 256 tokens.

The tip regarding disabling mmap on Android devices is a game-changer. I hadn't been aware of it previously, and it substantially accelerates prompt processing. I strongly advocate for emphasizing this point in the README.

Here's hoping that Vulkan proves to be a more robust solution than OpenGL.

Did your CLBLAST experience involve running corresponding tunners to achive speed for your device ?
Just for reference that in my experience with CLBLAST nonsensical infirence was fixed when I run and applied tunners. Hovewer on low end android device inference speed was the same as on cpu only without any loss in output.

gustrd · 2024-05-13T01:30:46Z

Hey everyone,
As the original author of these README instructions, I have to admit that I now see how they might cause more confusion than clarity.
Just to clarify for future users: I've personally found CLBlast to be quite effective when used with llama.cpp, especially for certain model families like StableLM and OpenLlama (provided you're not offloading layers). In my experience, it has boosted prompt processing speed by roughly 40%.
However, it's important to note that while CLBlast does offer significant speed improvements, it's plagued by bugs. For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output. This is disappointing, considering the untapped potential of the GPUs nestled within our smartphones.
If there's any way I can assist, I'd like to offer a few insights based on my experimentation:

In my tests, OpenBLAS consistently outperforms noblas, particularly for prompts exceeding 256 tokens.

The tip regarding disabling mmap on Android devices is a game-changer. I hadn't been aware of it previously, and it substantially accelerates prompt processing. I strongly advocate for emphasizing this point in the README.

Here's hoping that Vulkan proves to be a more robust solution than OpenGL.

Did your CLBLAST experience involve running corresponding tunners to achive speed for your device ? Just for reference that in my experience with CLBLAST nonsensical infirence was fixed when I run and applied tunners. Hovewer on low end android device inference speed was the same as on cpu only without any loss in output.

No, I have not tried the tunners yet. Good idea, it's a nice experiment to do. Thanks for the idea!

shibe2 · 2024-05-13T15:16:08Z

For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output.

Is this specific to Android builds or can be reproduced on PC too?

gustrd · 2024-05-13T15:19:22Z

For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output.

Is this specific to Android builds or can be reproduced on PC too?

As far as I know, it only happens during Android builds. All my tests were conducted with Adreno GPUs from Snapdragon.

Jeximo added 2 commits April 30, 2024 20:33

Tidy Android Instructions README.md

b115ad4

Remove CLBlast instructions(outdated), added OpenBlas.

don't assume git is installed

57a37f1

Added apt install git, so that git clone works

Jeximo added 2 commits April 30, 2024 21:18

removed OpenBlas

d2b4e1a

Linked to Linux build instructions

fix typo

c7032d3

Remove word "run"

slaren reviewed May 3, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

Jeximo and others added 3 commits May 3, 2024 09:53

correct style

868bb32

Co-authored-by: slaren <slarengh@gmail.com>

correct grammar

784d08e

Co-authored-by: slaren <slarengh@gmail.com>

delete reference to Android API

624a689

slaren reviewed May 4, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

slaren reviewed May 4, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

remove Fdroid reference, link directly to Termux

68e8732

Fdroid is not required Co-authored-by: slaren <slarengh@gmail.com>

Update README.md

c3fa382

Co-authored-by: slaren <slarengh@gmail.com>

slaren approved these changes May 4, 2024

View reviewed changes

slaren merged commit cf768b7 into ggerganov:master May 4, 2024
21 checks passed

Jeximo deleted the patch-2 branch May 5, 2024 16:26

gustrd mentioned this pull request May 11, 2024

Android OpenCL question #5621

Open

This was referenced May 12, 2024

How to enable OpenCL with llama.cpp in Android App? #3694

Closed

[User] Insert summary of your issue or enhancement.. LostRuins/koboldcpp#382

Open

Make in Termux (Android) LostRuins/koboldcpp#247

Closed

mofosyne added documentation Improvements or additions to documentation review complexity : low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tidy Android Instructions README.md #7016

Tidy Android Instructions README.md #7016

Jeximo commented Apr 30, 2024 •

edited

slaren commented Apr 30, 2024

Jeximo commented May 1, 2024

teleprint-me commented May 1, 2024

slaren commented May 1, 2024

Jeximo commented May 1, 2024

teleprint-me commented May 1, 2024

teleprint-me commented May 1, 2024

Jeximo commented May 2, 2024

Jeximo commented May 4, 2024

Jeximo commented May 5, 2024 •

edited

Jeximo commented May 6, 2024

gustrd commented May 11, 2024

gpokat commented May 12, 2024

gustrd commented May 13, 2024

shibe2 commented May 13, 2024

gustrd commented May 13, 2024

Tidy Android Instructions README.md #7016

Tidy Android Instructions README.md #7016

Conversation

Jeximo commented Apr 30, 2024 • edited

slaren commented Apr 30, 2024

Jeximo commented May 1, 2024

teleprint-me commented May 1, 2024

slaren commented May 1, 2024

Jeximo commented May 1, 2024

teleprint-me commented May 1, 2024

teleprint-me commented May 1, 2024

Jeximo commented May 2, 2024

Jeximo commented May 4, 2024

Jeximo commented May 5, 2024 • edited

Jeximo commented May 6, 2024

gustrd commented May 11, 2024

gpokat commented May 12, 2024

gustrd commented May 13, 2024

shibe2 commented May 13, 2024

gustrd commented May 13, 2024

Jeximo commented Apr 30, 2024 •

edited

Jeximo commented May 5, 2024 •

edited