Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: [Mac M4]llama-server cannot run in release-4409 but can run in 4406 #11083

Open
bobleer opened this issue Jan 5, 2025 · 0 comments

Comments

@bobleer
Copy link

bobleer commented Jan 5, 2025

Name and Version

llama-b4409-bin-macos-arm64.zip
llama-b4406-bin-macos-arm64.zip

Operating systems

Mac

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

4409 run log:

/Users/liwenbo/Downloads/4409-llamacpp/bin/llama-server -m /Users/liwenbo/models/qwen/Qwen2.5-1.5B-Instruct.Q4_K_M.gguf
dyld[18622]: Library not loaded: @rpath/libllama.dylib
  Referenced from: <A6F705D2-0AC3-32BD-8CF2-3A55262E9195> /Users/liwenbo/Downloads/4409-llamacpp/bin/llama-server
  Reason: tried: '/Users/runner/work/llama.cpp/llama.cpp/build/src/libllama.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/runner/work/llama.cpp/llama.cpp/build/src/libllama.dylib' (no such file), '/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/libllama.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/libllama.dylib' (no such file), '/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/ggml-blas/libllama.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/ggml-blas/libllama.dylib' (no such file), '/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/ggml-metal/libllama.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/ggml-metal/libllama.dylib' (no such file), '/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/ggml-rpc/libllama.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/ggml-rpc/libllama.dylib' (no such file), '/Users/runner/work/llama.cpp/llama.cpp/build/src/libllama.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/runner/work/llama.cpp/llama.cpp/build/src/libllama.dylib' (no such file), '/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/libllama.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/libllama.dylib' (no such file), '/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/ggml-blas/libllama.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/ggml-blas/libllama.dylib' (no such file), '/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/ggml-metal/libllama.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/ggml-metal/libllama.dylib' (no such file), '/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/ggml-rpc/libllama.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/runner/work/llama.cpp/llama.cpp/build/ggml/src/ggml-rpc/libllama.dylib' (no such file)
zsh: abort      /Users/liwenbo/Downloads/4409-llamacpp/bin/llama-server -m 

4406 run log:

/Users/liwenbo/Downloads/4406-llamacpp/bin/llama-server -m /Users/liwenbo/models/qwen/Qwen2.5-1.5B-Instruct.Q4_K_M.gguf
build: 4406 (0da5d860) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
system info: n_threads = 4, n_threads_batch = 4, total_threads = 10

system_info: n_threads = 4 (n_threads_batch = 4) / 10 | Metal : EMBED_LIBRARY = 1 | BF16 = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 | 

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 9
main: loading model
srv    load_model: loading model '/Users/liwenbo/models/qwen/Qwen2.5-1.5B-Instruct.Q4_K_M.gguf'
llama_load_model_from_file: using device Metal (Apple M4) - 10922 MiB free
llama_model_loader: loaded meta data with 32 key-value pairs and 338 tensors from /Users/liwenbo/models/qwen/Qwen2.5-1.5B-Instruct.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 1.5B
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-1...
llama_model_loader: - kv   6:                   general.base_model.count u32              = 1
llama_model_loader: - kv   7:                  general.base_model.0.name str              = Qwen2.5 1.5B
llama_model_loader: - kv   8:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv   9:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-1.5B
llama_model_loader: - kv  10:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  11:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  12:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  13:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  14:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv  15:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv  16:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  17:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  18:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  19:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:                          general.file_type u32              = 15
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 1536
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8960
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1.5B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 1.54 B
llm_load_print_meta: model size       = 934.69 MiB (5.08 BPW) 
llm_load_print_meta: general.name     = Models
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: Metal_Mapped model buffer size =   934.70 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   182.57 MiB
.....................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4
ggml_metal_init: picking default device: Apple M4
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M4
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = true
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28
llama_kv_cache_init:      Metal KV buffer size =   112.00 MiB
llama_new_context_with_model: KV self size  =  112.00 MiB, K (f16):   56.00 MiB, V (f16):   56.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      Metal compute buffer size =   299.75 MiB
llama_new_context_with_model:        CPU compute buffer size =    11.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, chat_template: (built-in), example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
^C

First Bad Commit

No response

Relevant log output

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant