Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncaught SIGSEGV (SEGV_MAPERR) with Meta-Llama-3-8B-Instruct.Q2_K #378

Open
Lathanao opened this issue Apr 27, 2024 · 0 comments
Open

Uncaught SIGSEGV (SEGV_MAPERR) with Meta-Llama-3-8B-Instruct.Q2_K #378

Lathanao opened this issue Apr 27, 2024 · 0 comments

Comments

@Lathanao
Copy link

Lathanao commented Apr 27, 2024

With:

./Meta-Llama-3-8B-Instruct.Q2_K.llamafile -ngl 9999

Have this error at the first prompt, what ever I prompt:

import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
get_rocm_bin_path: note: rocminfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/rocminfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/rocminfo does not exist
get_amd_offload_arch_flag: warning: can't find hipInfo/rocminfo commands for AMD GPU detection
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=native -march=native -mtune=native -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/yo/.llamafile/ggml-rocm.so.vadb38 /home/yo/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
hipcc: No such file or directory
extract_cuda_dso: note: prebuilt binary /zip/ggml-rocm.so not found
link_cuda_dso: note: dynamically linking /home/yo/.llamafile/ggml-cuda.so
ggml_cuda_link: welcome to CUDA SDK with cuBLAS
link_cuda_dso: GPU support loaded
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2839,"msg":"build info","tid":"8545344","timestamp":1714201880}
{"function":"server_cli","level":"INFO","line":2842,"msg":"system info","n_threads":6,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"8545344","timestamp":1714201880,"total_threads":12}
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from Meta-Llama-3-8B-Instruct.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 10
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q2_K:  129 tensors
llama_model_loader: - type q3_K:   64 tensors
llama_model_loader: - type q4_K:   32 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 2.95 GiB (3.16 BPW) 
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   164.39 MiB
llm_load_tensors:      CUDA0 buffer size =  2859.99 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =   258.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     9.00 MiB
llama_new_context_with_model: graph nodes  = 1060
llama_new_context_with_model: graph splits = 2
{"function":"initialize","level":"INFO","line":481,"msg":"initializing slots","n_slots":1,"tid":"8545344","timestamp":1714201882}
{"function":"initialize","level":"INFO","line":490,"msg":"new slot","n_ctx_slot":512,"slot_id":0,"tid":"8545344","timestamp":1714201882}
{"function":"server_cli","level":"INFO","line":3060,"msg":"model loaded","tid":"8545344","timestamp":1714201882}

llama server listening at http://127.0.0.1:8080

opening browser tab... (pass --nobrowser to disable)
{"function":"server_cli","hostname":"127.0.0.1","level":"INFO","line":3183,"msg":"HTTP server listening","port":"8080","tid":"8545344","timestamp":1714201882}
{"function":"validate_model_chat_template","level":"ERR","line":470,"msg":"The chat template comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses","tid":"8545344","timestamp":1714201882}
{"function":"update_slots","level":"INFO","line":1619,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"8545344","timestamp":1714201882}
Opening in existing browser session.
{"function":"log_server_request","level":"INFO","line":2764,"method":"GET","msg":"request","params":{},"path":"/","remote_addr":"127.0.0.1","remote_port":54622,"status":200,"tid":"17594341382800","timestamp":1714201882}
{"function":"log_server_request","level":"INFO","line":2764,"method":"GET","msg":"request","params":{},"path":"/completion.js","remote_addr":"127.0.0.1","remote_port":54628,"status":200,"tid":"17594341384672","timestamp":1714201882}
{"function":"log_server_request","level":"INFO","line":2764,"method":"GET","msg":"request","params":{},"path":"/json-schema-to-grammar.mjs","remote_addr":"127.0.0.1","remote_port":54636,"status":200,"tid":"17594335595984","timestamp":1714201882}
{"function":"log_server_request","level":"INFO","line":2764,"method":"GET","msg":"request","params":{},"path":"/index.js","remote_addr":"127.0.0.1","remote_port":54622,"status":200,"tid":"17594341382800","timestamp":1714201882}
parse: error parsing grammar: expecting ::= at me how to draw a the mount Fuji. Detailed art, no color, clear weather.
Then, do:
- Create a html skeleton
- Add a canvas HTML tag in the middle.
Then
- Read again your previous answer where your listed all steps to draw the mount Fuji.
- For every steps, create the perfect code to draw exactly what is describe in the step.
- Check if the draw is beautiful or not.
llama_sampling_init: failed to parse grammar
{"function":"launch_slot_with_data","level":"INFO","line":871,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"8545344","timestamp":1714201886}

error: Uncaught SIGSEGV (SEGV_MAPERR) at 0x128 on Yocom pid 40372 tid 40372
  ./Meta-Llama-3-8B-Instruct.Q2_K.llamafile
  No such file or directory
  Linux Cosmopolitan 3.3.3 MODE=x86_64; #1 SMP PREEMPT_DYNAMIC Wed Apr 10 20:11:08 UTC 2024 Yocom 6.6.26-1-MANJARO

RAX 00001000814da5d0 RBX 00000000000007ec RDI 0000000000000000
RCX 0000000000000000 RDX 00000000000007ec RSI 0000100086c90010
RBP 00007ffd730c44f0 RSP 00007ffd730c4480 RIP 000000000056eaf0
 R8 0000100080040000  R9 00001000814da900 R10 00001000814e9360
R11 0000000000000080 R12 0000000000000000 R13 0000000000000000
R14 00007ffd730c7ce8 R15 00007ffd730c7f10
TLS 0000000000704e40

XMM0  00000000000000000000000000000000 XMM8  00007fbc3662301800007fbc36623020
XMM1  00001000814da2a000001000814da2a0 XMM9  00007fbc3662302800007fbc36623030
XMM2  222c3137383a22656e696c222c224f46 XMM10 00007fbc3662303800007fbc36623040
XMM3  4e49223a226c6576656c222c22617461 XMM11 00007fbc3662304800007fbc36623050
XMM4  6e656577746562206e6f697461737265 XMM12 00007fbc3662305800007fbc36623060
XMM5  5f6b736174222c303a2264695f746f6c XMM13 00007fbc3662306800007fbc36623070
XMM6  61645f687469775f746f6c735f68636e XMM14 00007fbc3662307800007fbc36623080
XMM7  75616c223a226e6f6974636e7566227b XMM15 00000000000000000000000000000000

cosmoaddr2line /media/Qemu/Model/Meta-Llama-3-8B-Instruct.Q2_K.llamafile 56eaf0 48c6a3 48abc5 43fadc 401b81 410a03 4015fb

0x000000000056eaf0: ?? ??:0
0x000000000048c6a3: ?? ??:0
0x000000000048abc5: ?? ??:0
0x000000000043fadc: ?? ??:0
0x0000000000401b81: ?? ??:0
0x0000000000410a03: ?? ??:0
0x00000000004015fb: ?? ??:0

10008004-10008009 rw-pa-      6x automap 384kB w/ 128kB hole
1000800c-1000800f rw-Sa-      4x automap 256kB
10008010-1000801f rw-pa-     16x automap 1024kB
10008020-1000803f rw-Sa-     32x automap 2048kB
10008040-10008077 rw-pa-     56x automap 3584kB
10008078-10008083 rw-Sa-     12x automap 768kB w/ 8256kB hole
10008105-100086cb rw-pa-  1'479x automap 92mB w/ 3328kB hole
10008700-100087e6 rw-pa-    231x automap 14mB w/ 1927mB hole
1001005a-1001bef9 r--s-- 48'800x automap 3050mB w/ 1040mB hole
10020000-1002bd85 r--s-- 48'518x automap 3032mB w/ 96tB hole
6fc00004-6fc00004 rw-paF      1x nsync 64kB w/ 64gB hole
6fd00004-6fd0000f rw-paF     12x zipos 768kB w/ 64gB hole
6fe00004-6fe00004 rw-paF      1x g_fds 64kB
# 6198mB total mapped memory
./Meta-Llama-3-8B-Instruct.Q2_K.llamafile -m Meta-Llama-3-8B-Instruct.Q2_K.gguf -ngl 9999 
zsh: segmentation fault (core dumped)  ./Meta-Llama-3-8B-Instruct.Q2_K.llamafile -ngl 9999````


Llamafile --version
```bash
llamafile v0.8.1

Cuda version:

Version         : 12.3.2-1
Description     : NVIDIA's GPU programming toolkit
Architecture    : x86_64
URL             : https://developer.nvidia.com/cuda-zone
Licenses        : LicenseRef-NVIDIA-CUDA
Groups          : None
Provides        : cuda-toolkit  cuda-sdk  libcudart.so=12-64  libcublas.so=12-64  libcublas.so=12-64  libcusolver.so=11-64  libcusolver.so=11-64
                  libcusparse.so=12-64  libcusparse.so=12-64

My machine:

System:
  Kernel: 6.6.26-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 13.2.1
  Desktop: GNOME v: 45.4 tk: GTK v: 3.24.41 Distro: Manjaro
    base: Arch Linux
Machine:
  Type: Laptop System: HP product: HP Pavilion Gaming Laptop 15-cx0xxx
Memory:
  System RAM: total: 32 GiB available: 31.24 GiB used: 4.16 GiB (13.3%)
CPU:
  Info: model: Intel Core i7-8750H bits: 64 type: MT MCP arch: Coffee Lake
    gen: core 8 level: v3 note: 
Graphics:
  Device-2: NVIDIA GP107M [GeForce GTX 1050 Ti Mobile]
    vendor: Hewlett-Packard driver: nvidia v: 550.67
    alternate: nouveau,nvidia_drm non-free: 545.xx+ status: current (as of
    2024-04; EOL~2026-12-xx) arch: Pascal code: GP10x process: TSMC 16nm
    built: 2016-2021 pcie: gen: 1 speed: 2.5 GT/s lanes: 16 link-max: gen: 3
    speed: 8 GT/s bus-ID: 01:00.0 chip-ID: 10de:1c8c class-ID: 0300
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant