Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple issues with setting up the text chatbot service on SPR #1300

Open
tbykowsk opened this issue Feb 22, 2024 · 5 comments
Open

Multiple issues with setting up the text chatbot service on SPR #1300

tbykowsk opened this issue Feb 22, 2024 · 5 comments
Assignees

Comments

@tbykowsk
Copy link
Member

Hi,

I have followed this instruction intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/docs/notebooks/setup_text_chatbot_service_on_spr.ipynb at main · intel/intel-extension-for-transformers (github.com) and written down a couple of issues with potential solutions, which you may want to consider implementing.

I am using Ubuntu 22.04 LTS and Python 3.10.12.

  1. Setup backend / Setup environment
    !git clone https://github.com/intel/intel-extension-for-transformers.git

The instruction says to used HEAD on the master branch, even though the repository is being actively developed. This causes problems like this 422 Unprocessable Entity using Neural Chat via OpenAI interface with meta--lama/llama-2-7b-chat-hf · Issue #1288 · intel/intel-extension-for-transformers (github.com)

I have also encountered the aforementioned issue, and then decided to use the latest release which is v1.3.1.
It would be useful to add information to the instruction about a commit/release it was validated with.

I was continuing with v1.3.1 from now on.

  1. Setup backend / Setup environment
%cd ./intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/ 
!pip install -r requirements.txt

The requirements.txt installs torch==2.1.0, but once the backend server is started, it complains about torch compatility:

ERROR! Intel® Extension for PyTorch* needs to work with PyTorch 2.2.*, but PyTorch 2.1.0+cu121 is found. Please switch to the matching version and run again.

The work-around is to reinstall torch and its dependencies manually to get compatible versions:

pip uninstall torch torchaudio torchvision xformers -y
pip install torch torchaudio torchvision xformers

  1. Setup backend / Setup environment
    !pip install nest_asyncio

It is a bit confusing that nest_syncio has to be installed manually and is not added to requirements.txt for the backend.

  1. Deploy frontend on your server / Install the required Python dependencies
    !pip install -r ./examples/deployment/textbot/frontend/requirements.txt

The requirements.txt again installs torch==2.1.0, which makes the backend unusable. Please consider using compatible packages for both components, or maybe suggest in the instruction to create a separate Python virtual environment for each component if the same host is used.

  1. Deploy frontend on your server / Run the frontend
    !nohup python app.py &

There is an issue with fastchat.utils when starting app.py:

Traceback (most recent call last):
File "[…]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/app.py", line 38, in
from fastchat.utils import (
ImportError: cannot import name 'violates_moderation' from 'fastchat.utils' ([…]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/veee/lib/python3.10/site-packages/fastchat/utils.py)

I have worked around this by removing the reference to violates_moderation in app.py, but you may want to investigate the problem further.

  1. Deploy frontend on your server / Run the frontend
    !nohup python app.py &

There is an issue with gradio package which occurs when a NeuralChat URL is loaded in a browser:

2024-02-21 13:18:12 | INFO | gradio_web_server | Models: ['Intel/neural-chat-7b-v3-1']
2024-02-21 13:18:13 | ERROR | stderr | sys:1: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
2024-02-21 13:18:13 | INFO | stdout | Running on local URL: http://0.0.0.0:8080/
2024-02-21 13:18:13 | INFO | stdout |
2024-02-21 13:18:13 | INFO | stdout | To create a public link, set share=True in launch().
2024-02-21 13:18:48 | INFO | stdout | 1 validation error for PredictBody
2024-02-21 13:18:48 | INFO | stdout | event_id
2024-02-21 13:18:48 | INFO | stdout | Field required [type=missing, input_value={'data': [{}], 'event_dat...on_hash': 'w9eie3cvduh'}, input_type=dict]
2024-02-21 13:18:48 | INFO | stdout | For further information visit https://errors.pydantic.dev/2.6/v/missing

The solution to this issue is updating gradio to at least version 3.50.2.
The incompatible version is installed in the source code of app.py, so this line has to be changed:

os.system("pip install gradio==3.36.0")

  1. Deploy frontend on your server / Run the frontend
    !nohup python app.py &

After the NeuralChat URL successfully loads in the browser, the chat replies only with:

NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE. (error_code: 4)

It is cause by the error in the backend:

ModuleNotFoundError: No module named 'neural_speed'

To fix this, one may add neural_speed to intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/requirements.txt, or install the package manually.

  1. Just wondering what is the reason that the frontend is able to handle Intel/neural-chat-7b-v3-1 and meta-llama/Llama-2-7b-chat-hf, but for example fails with meta-llama/Llama-2-13b-chat-hf:

2024-02-21 13:39:08 | ERROR | stderr | Traceback (most recent call last):
2024-02-21 13:39:08 | ERROR | stderr | File "[...]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/veee/lib/python3.10/site-packages/gradio/queueing.py", line 407, in call_prediction
2024-02-21 13:39:08 | ERROR | stderr | output = await route_utils.call_process_api(
2024-02-21 13:39:08 | ERROR | stderr | File "[...]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/veee/lib/python3.10/site-packages/gradio/route_utils.py", line 226, in call_process_api
2024-02-21 13:39:08 | ERROR | stderr | output = await app.get_blocks().process_api(
2024-02-21 13:39:08 | ERROR | stderr | File "[...]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/veee/lib/python3.10/site-packages/gradio/blocks.py", line 1550, in process_api
2024-02-21 13:39:08 | ERROR | stderr | result = await self.call_function(
2024-02-21 13:39:08 | ERROR | stderr | File "[...]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/veee/lib/python3.10/site-packages/gradio/blocks.py", line 1199, in call_function
2024-02-21 13:39:08 | ERROR | stderr | prediction = await utils.async_iteration(iterator)
2024-02-21 13:39:08 | ERROR | stderr | File "[...]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/veee/lib/python3.10/site-packages/gradio/utils.py", line 519, in async_iteration
2024-02-21 13:39:08 | ERROR | stderr | return await iterator.anext()
2024-02-21 13:39:08 | ERROR | stderr | File "[...]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/veee/lib/python3.10/site-packages/gradio/utils.py", line 512, in anext
2024-02-21 13:39:08 | ERROR | stderr | return await anyio.to_thread.run_sync(
2024-02-21 13:39:08 | ERROR | stderr | File "[...]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/veee/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
2024-02-21 13:39:08 | ERROR | stderr | return await get_async_backend().run_sync_in_worker_thread(
2024-02-21 13:39:08 | ERROR | stderr | File "[...]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/veee/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread
2024-02-21 13:39:08 | ERROR | stderr | return await future
2024-02-21 13:39:08 | ERROR | stderr | File "[...]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/veee/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run
2024-02-21 13:39:08 | ERROR | stderr | result = context.run(func, *args)
2024-02-21 13:39:08 | ERROR | stderr | File "[...]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/veee/lib/python3.10/site-packages/gradio/utils.py", line 495, in run_sync_iterator_async
2024-02-21 13:39:08 | ERROR | stderr | return next(iterator)
2024-02-21 13:39:08 | ERROR | stderr | File "[...]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/veee/lib/python3.10/site-packages/gradio/utils.py", line 649, in gen_wrapper
2024-02-21 13:39:08 | ERROR | stderr | yield from f(*args, **kwargs)
2024-02-21 13:39:08 | ERROR | stderr | File "[...]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/app.py", line 331, in http_bot
2024-02-21 13:39:08 | ERROR | stderr | new_state = get_conv_template(model_name.split('/')[-1])
2024-02-21 13:39:08 | ERROR | stderr | File "[...]/intel-extension-for-transformers-1.3.1/intel_extension_for_transformers/neural_chat/ui/gradio/basic/./conversation.py", line 300, in get_conv_template
2024-02-21 13:39:08 | ERROR | stderr | return conv_templates[name].copy()
2024-02-21 13:39:08 | ERROR | stderr | KeyError: 'Llama-2-13b-chat-hf'

The backend seems to load meta-llama/Llama-2-13b-chat-hf correctly. Maybe enabling other models from the same family in the frontend would not require many changes.

Thank you for taking the time to read trough all this text :)

@lvliang-intel
Copy link
Collaborator

@tbykowsk,
Apologies for the delayed response regarding this issue. I hadn't realized it was assigned to me.
You can refer to the quick start example for setuping the chatbot service. I've updated the notebook via PR 1482, you can also try the notebook using this branch.
If you have any questions, please feel free to ask, and you can reach me directly on Teams, which should be more efficient.
Thanks.

@tbykowsk
Copy link
Member Author

Thank you @lvliang-intel,

I can confirm that issues 1-7 for instruction notebooks/setup_text_chatbot_service_on_spr.ipynb are resolved.
The 8th issue involves a model which is still not supported, so maybe it is beyond the scope of this project.

However, the instruction lacks information where to execute the step "Startup the backend server". A line could be added that the server has to be run from the main directory of repository (./intel-extension-for-transformers). The previous step changes the current directory to ./intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/ and the server reports missing dependencies if started from there.
For clarity, a full path could be provided again in the step "Deploy frontend on your server":
%cd ./intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/ui/gradio/basic

========================================================================

I also followed the quick start example which you referred, focusing on steps for CPU.

Sections "1. Setup Environment" and "2. Run the chatbot in command mode" executed without any issues.
In the section "3. Run chatbot in server mode with UI" in step "3.1.1 Verify the client connection to server is OK." I got this error:

$ curl -vv -X POST http://127.0.0.1:8000/v1/chat/completions
* Uses proxy env variable no_proxy == '10.0.0.0/8,localhost,.local,127.0.0.0/8'
*   Trying 127.0.0.1:8000...
* Connected to 127.0.0.1 (127.0.0.1) port 8000 (#0)
> POST /v1/chat/completions HTTP/1.1
> Host: 127.0.0.1:8000
> User-Agent: curl/7.81.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 422 Unprocessable Entity
< date: Fri, 19 Apr 2024 07:34:50 GMT
< server: uvicorn
< content-length: 81
< content-type: application/json
<
* Connection #0 to host 127.0.0.1 left intact
{"detail":[{"loc":["body"],"msg":"field required","type":"value_error.missing"}]}

Output from the server:

$ python chatbot_server.py
[…]
2024-04-19 07:28:28,318 - root - INFO - Model loaded.
Loading config settings from the environment...
INFO:     Started server process [935582]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000/ (Press CTRL+C to quit)
INFO:     127.0.0.1:53466 - "POST /v1/chat/completions HTTP/1.1" 422 Unprocessable Entity

However, the next step "3.1.2 Test request command at client side" executed correctly.
Also sections "3.2 Set up Server mode UI" and "3.3 Start the web service" did not return any errors.

The main issue which I noticed with this quick start example, is that NeuralChat replies over 10 times slower than NeuralChat from the previous instruction notebooks/setup_text_chatbot_service_on_spr.ipynb.

Both times I used the same question:

Tell me about Intel Xeon Scalable Processors.

and received very similar answers. NeuralChat from the quick start example was generating a response for around 13 minutes, whereas NeuralChat from the previous instruction took only around 50 seconds. I ran this experiment on the same platform.

Log from the quick start example:

2024-04-19 08:27:34 | INFO | gradio_web_server | ==== request ====
{'model': 'Intel/neural-chat-7b-v3-1', 'messages': [{'role': 'system', 'content': '### System:\n    - You are a helpful assistant chatbot trained by Intel.\n    - You answer questions.\n    - You are excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.\n    - You are more than just an information source, you are also able to write poetry, short stories, and make jokes.</s>\n'}, {'role': 'user', 'content': 'Tell me about Intel Xeon Scalable Processors.'}], 'temperature': 0.001, 'top_p': 0.95, 'max_tokens': 512, 'stream': True}
2024-04-19 08:40:42 | INFO | gradio_web_server | Intel Xeon Scalable Processors, also known as the "Cascade Lake" family, are a series of high-performance central processing units (CPUs) designed for data centers, cloud computing, and other demanding workloads. These processors are built upon Intel's latest 14nm process technology and feature a modular design that allows for flexible configurations to meet various needs. The Xeon Scalable Processors offer significant improvements in performance, efficiency, and scalability compared to their predecessors. They are designed to handle a wide range of tasks, from general-purpose computing to specialized applications like artificial intelligence, high-performance computing, and virtualization. Some key features of the Intel Xeon Scalable Processors include: 1. Scalable architecture: The modular design allows for customization, enabling users to choose the right combination of cores, memory, and I/O capabilities to match their specific requirements. 2. High-performance cores: The processors feature Intel's latest high-performance cores, which deliver increased performance and efficiency for a variety of workloads. 3. Advanced security features: The Xeon Scalable Processors come with built-in security features, such as Intel Software Guard Extensions (SGX), which help protect sensitive data and code from unauthorized access. 4. Enhanced memory support: The processors support up to 4TB of memory, enabling them to handle large datasets and complex workloads efficiently. 5. Improved I/O capabilities: The Xeon Scalable Processors offer advanced I/O capabilities, including support for PCIe 3.0 and 4.0, which can significantly improve data transfer speeds and overall system performance. In summary, the Intel Xeon Scalable Processors are a powerful and versatile family of CPUs designed to meet the demands of modern data centers and high-performance computing environments. They offer a range of features that make them suitable for various applications, from general-purpose computing to specialized tasks like artificial intelligence and virtualization.

Log from the previous instruction:

2024-04-19 08:51:33 | INFO | gradio_web_server | ==== request ====
{'model': 'Intel/neural-chat-7b-v3-1', 'messages': [{'role': 'system', 'content': '### System:\n    - You are a helpful assistant chatbot trained by Intel.\n    - You answer questions.\n    - You are excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.\n    - You are more than just an information source, you are also able to write poetry, short stories, and make jokes.</s>\n'}, {'role': 'user', 'content': 'Tell me about Intel Xeon Scalable Processors.'}], 'temperature': 0.001, 'top_p': 0.95, 'max_tokens': 512, 'stream': True}
2024-04-19 08:51:33 | INFO | httpx | HTTP Request: POST http://127.0.0.1:8000/v1/chat/completions "HTTP/1.1 200 OK"
[…]
2024-04-19 08:52:24 | INFO | httpx | HTTP Request: POST http://localhost:8080/api/predict "HTTP/1.1 200 OK"
2024-04-19 08:52:24 | INFO | gradio_web_server | Intel Xeon Scalable Processors, also known as the "Cascade Lake" family, are a series of high-performance central processing units (CPUs) designed for data centers, cloud computing, and other demanding workloads. These processors are built upon Intel's latest 14nm process technology and feature a modular design that allows for flexible configurations to meet various needs. The Xeon Scalable Processors offer significant improvements in performance, efficiency, and scalability compared to their predecessors. They are designed to handle a wide range of tasks, from general-purpose computing to specialized applications like artificial intelligence, high-performance computing, and virtualization. Some key features of the Intel Xeon Scalable Processors include: 1. Scalable architecture: The modular design allows for customization, enabling users to choose the right combination of cores, memory channels, and other features to optimize performance for their specific workloads. 2. High-speed memory support: The processors support up to 6 memory channels, enabling faster data access and improved performance for memory-intensive applications. 3. Advanced security features: The Xeon Scalable Processors come with built-in security features like Intel Software Guard Extensions (SGX) and Intel Trusted Execution Technology (TXT) to protect sensitive data and prevent unauthorized access. 4. Enhanced virtualization capabilities: The processors are designed to support multiple virtual machines, allowing for efficient resource utilization and improved performance in virtualized environments. 5. Improved power efficiency: The Xeon Scalable Processors are designed to optimize power consumption, reducing operational costs and minimizing environmental impact. In summary, the Intel Xeon Scalable Processors are a powerful and versatile family of CPUs that cater to the needs of various industries and applications. They offer scalability, high performance, and advanced security features, making them a popular choice for data centers and other demanding computing environments.

Please notice the time difference between the question and reply. Are there any optimizations missing in the quick start example?

========================================================================

I have also read the main NeuralChat Reame for completeness.

In the step Launch OpenAI-compatible Service I was not able to run
neuralchat_server start --config_file ./server/config/neuralchat.yaml
and I tried from different directories within the repository . The application always ended up with this error:

[...]
2024-04-19 10:19:19,158 - root - INFO - Model loaded.
Loading config settings from the environment...
[2024-04-19 10:19:19,192] [   ERROR] - Failed to start server.
[2024-04-19 10:19:19,192] [   ERROR] - No module named 'langchain_community'

I received the same error while running the example Python Code from ./intel-extension-for-transformers/intel_extension_for_transformers/neural_chat directory:

[...]
    if self.init(config):
  File "[…]/intel-extension-for-transformers/my2_venv/lib/python3.10/site-packages/intel_extension_for_transformers/neural_chat/server/neuralchat_server.py", line 326, in init
    from .restful.api import setup_router
  File "[…]/intel-extension-for-transformers/my2_venv/lib/python3.10/site-packages/intel_extension_for_transformers/neural_chat/server/restful/api.py", line 26, in <module>
    from .retrieval_api import router as retrieval_router
  File "[…]/intel-extension-for-transformers/my2_venv/lib/python3.10/site-packages/intel_extension_for_transformers/neural_chat/server/restful/retrieval_api.py", line 38, in <module>
    from ...pipeline.plugins.retrieval.parser.context_utils import clean_filename
  File "[…]/intel-extension-for-transformers/my2_venv/lib/python3.10/site-packages/intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/context_utils.py", line 21, in <module>
    from langchain_community.document_loaders import UnstructuredMarkdownLoader
ModuleNotFoundError: No module named 'langchain_community'

The Python code only worked when executed from the root directory of the repository (./intel-extension-for-transformers). Once the server started, I was able to connect to it using all methods from Asses the Service section.

Is there a configuration step missing in this Readme (analogically like in the first instruction) regarding from where the neuralchat_server or the Python code have to be run? I create a separate Python virtual environment for each experiment to maintain order on the machine. Could there be something specific in my execution environment?
Thanks!

@tbykowsk
Copy link
Member Author

Quick update, I have taken the yesterday's release of v1.4.1 and can confirm that both issues with paths are resolved (for instruction notebooks/setup_text_chatbot_service_on_spr.ipynb and for the main NeuralChat Reame).
The sample code of backend server runs without any errors from ./intel-extension-for-transformers/intel_extension_for_transformers/neural_chat directory, as well as neuralchat_server executable.

The performance issue with quick start example is also fixed. The only problem remains in step "3.1.1 Verify the client connection to server is OK", unless receiving 422 Unprocessable Entity there is expected. Thanks!

@tbykowsk
Copy link
Member Author

tbykowsk commented May 7, 2024

@lvliang-intel,
could you please clarify the expected outcome in step 3.1.1 Verify the client connection to server is OK? Receiving error 422 Unprocessable Entity when following the instruction might be misleading

@letonghan
Copy link
Contributor

letonghan commented Jun 3, 2024

Hi @tbykowsk,
The output of 422 Unprocessable Entity in step 3.1.1 Verify the client connection to server is OK is as expected.

Further explanation

  • 422 Unprocessable Entity means that you are able to connection to backend client and receive an HTTP status code.
  • It is because there's no -d {} statement in this curl command, and the backend did not receive the correct data format. That's the reason why the curl command in 3.1.2 works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants