Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

中文乱码 #104

Open
linonetwo opened this issue Jul 19, 2023 · 0 comments
Open

中文乱码 #104

linonetwo opened this issue Jul 19, 2023 · 0 comments

Comments

@linonetwo
Copy link

linonetwo commented Jul 19, 2023

#100 (reply in thread) 的讨论,官网示例 + ggml-vic7b-q5_1.bin 就会出乱码,而直接用 llama.cpp 运行则不会。

function toUnicode(string_) {
  return string_.split('').map(function(value, index, array) {
    const temporary = value.charCodeAt(0).toString(16).toUpperCase();
    if (temporary.length > 2) {
      return '\\u' + temporary;
    }
    return value;
  }).join('');
}
function unicodeToChar(text) {
   return text.replace(/\\u[\dA-F]{4}/gi, 
          function (match) {
               return String.fromCharCode(parseInt(match.replace(/\\u/g, ''), 16));
          });
}

看了一下,输出的乱码字符都是 \uFFFD\uFFFD\uFFFD , 实际显示就是 https://www.fileformat.info/info/unicode/char/fffd/index.htm

% node "/Users/linonetwo/Desktop/repo/TiddlyGit-Desktop/scripts/tryllm.mjs"
llama.cpp: loading model from /Users/linonetwo/Documents/languageModel/ggml-vic7b-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 6612.59 MB (+ 2052.00 MB per state)
.
llama_init_from_file: kv self size  = 1024.00 MB
[Wed, 19 Jul 2023 05:53:18 +0000 - INFO - llama_node_cpp::context] - AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
[Wed, 19 Jul 2023 05:53:18 +0000 - INFO - llama_node_cpp::llama] - tokenized_stop_prompt: None
 Tiddlywiki是一种非常���活的单页网站,���可以用于记录任何类型的信息,包���文本、���加文件、URL等等。���的特点是:

1. 非常���活:Tiddlywiki可以根据自���的需求进行定制,可以添加任何类型的元素。
2. ���于使用:Tiddlywiki的界面非常���单,只需要点击一下���可开始编���。
3. 可���展性���:Tiddlywiki可以通过���件���展其功能,可以���展到任何需要。
4. 可���性���:Tiddlywiki使用文本文件存���数据,可以在任何时间和任何地方���问。

���的来说,Tiddlywiki是一种非常实用的工���,可以用于���种场景,包���个人用户、学生、工作人员等等。

<end>

https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/ 可以看到 Tiddlywiki是一种非常灵活的单页网站 中的 应该变成 <0xE7><0x81><0xB5> ,但可能没有被处理好

截屏2023-07-19 14 55 53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant