Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device does not exist / is not supported error with neuralchat deploy_chatbot_on_xpu notebook #1276

Open
brent-elliott opened this issue Feb 14, 2024 · 6 comments
Assignees
Labels

Comments

@brent-elliott
Copy link

Problem Summary and status of similar tests

I am having trouble getting neuralchat to work with my Intel Data Center Flex 170 GPU. Below is my procedure with the build_chatbot_on_xpu Jupyter notebook with a clean environment. I have tried this procedure multiple times and also attempted to follow different instructions from different sources but have the same outcome each time. When I get to the point of running the inference, I get either “Device does not exist” when I stick with the default device reference xpu or “Device is not supported” if I use xpu:0. I have tried this with several different Python versions, but use 3.9 below.

I have BigDL operational on this XPU and system (in a separate environment and not running during these tests below). I have also successfully used the deploy_chatbot_on_icx notebook (again in a separate environment and not running at the same time) using similar tweaks as outlined below to address missing dependencies in requirements.txt in my environment.

I also tried to get deploy_chatbot_on_xpu working (below I focus on build_chatbot_on_xpu). As long as I bring over the code from deploy_chatbot_on_cpu (to address the error relating to asyncio), I can successfully run the server but again get the error related to Device does not exist with device=’xpu’ and Device is not supported with device=’xpu:0’.

I am hoping to get feedback on what I am doing wrong so that I can operate neural chat and successfully employ the OpenAI APIs.

Installation Procedure

Install Data Center GPU Drivers – per https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps

Prepare clean environment

conda create -n jupyter2 python=3.9
conda activate jupyter2

Install OneAPI for PyTorch 2.1 with apt installer – per https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html

wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install -y intel-basekit

Add the following to ~/.bashrc and source ~/.bashrc

# Required step for APT or offline installed oneAPI. Configure oneAPI environment variables. Skip this step for pip-installed oneAPI since LD_LIBRARY_PATH has already been configured.
source /opt/intel/oneapi/setvars.sh
# Recommended Environment Variables
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

Install Intel Extension for PyTorch – https://intel.github.io/intel-extension-for-pytorch/index.html#installation
Choose GPU, v2.1.10+xpyu, Linux, pip

sudo apt install -y intel-oneapi-dpcpp-cpp-2024.0 intel-oneapi-mkl-devel=2024.0.0-49656	# nothing is updated since the newest version is already installed from above
python -m pip install torch==2.1.0a0 torchvision==0.16.0a0 torchaudio==2.1.0a0 intel-extension-for-pytorch==2.1.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

Preparation for Sanity Check

source {DPCPPROOT}/env/vars.sh
source {MKLROOT}/env/vars.sh

Since these folders were not explicitly described in the documentation, I assumed it should be the following two commands

source /opt/intel/oneapi/dpcpp-ct/2024.0/env/vars.sh
source /opt/intel/oneapi/mkl/2024.0/env/vars.sh

Run Sanity Check

python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"

Sanity Check Response

2.1.0a0+cxx11.abi
2.1.10+xpu
[0]: _DeviceProperties(name='Intel(R) Data Center GPU Flex 170', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=13535MB, max_compute_units=512, gpu_eu_count=512)

Download relevant Notebooks

wget https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/docs/notebooks/build_chatbot_on_xpu.ipynb
wget https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/docs/notebooks/deploy_chatbot_on_xpu.ipynb

Install and run Jupyter

pip install jupyter
jupyter notebook --ip 0.0.0.0

Connect to jupyter URL in browser

Try out build_chatbot on xpu notebook
Add conda env list to confirm proper env is in use and pip install pickleshare (since a later step gives a warning about this but probably is not required)
Skip step on oneapi since it was installed in advance
setvars.sh shows already run as expected
Skip step on Install Intel Extensino for Pytorch, etc. from source at it should have been done above
Add !pip install pydub pymysql deepface exifread before Inference: Text Chat since these dependencies are missing
Inference: Text Chat Response

2024-02-14 09:59:28 [ERROR] neuralchat error: Device does not exist
Loading model Intel/neural-chat-7b-v3-1
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[19], line 5
      3 config = PipelineConfig(device='xpu')
      4 chatbot = build_chatbot(config)
----> 5 response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
      6 print(response)

AttributeError: 'NoneType' object has no attribute 'predict'
# Inference : Text Chat Response – after changing device=’xpu’ to device=’xpu:0’
2024-02-14 10:01:53 [ERROR] neuralchat error: Device is not supported
Loading model Intel/neural-chat-7b-v3-1
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[23], line 5
      3 config = PipelineConfig(device='xpu:0')
      4 chatbot = build_chatbot(config)
----> 5 response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
      6 print(response)

AttributeError: 'NoneType' object has no attribute 'predict'

Copying the same Text Chat script into xputest.py and running from command-line gets a different error (here with device=’xpu’) – why is this a different response than from within Jupyter? I have confirmed setvars.sh has been sourced and that I am in the same jupyter2 environment.

/home/REDACTED/miniconda3/envs/jupyter2/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
/home/REDACTED/miniconda3/envs/jupyter2/lib/python3.9/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
  warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
Loading config settings from the environment...
2024-02-14 10:18:37.549692: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-14 10:18:37.553245: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-14 10:18:37.599354: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-14 10:18:37.599393: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-14 10:18:37.600837: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-14 10:18:37.609277: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-14 10:18:37.609563: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-14 10:18:38.533993: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-02-14 10:18:42,841 - datasets - INFO - PyTorch version 2.1.0a0+cxx11.abi available.
2024-02-14 10:18:42,841 - datasets - INFO - TensorFlow version 2.15.0.post1 available.
Loading model Intel/neural-chat-7b-v3-1
Loading checkpoint shards: 100%| 2/2 [00:01<00:00,  1.23it/s]
2024-02-14 10:19:17,805 - root - ERROR - Exception: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)
2024-02-14 10:19:17 [ERROR] neuralchat error: Generic error
Traceback (most recent call last):
  File "/home/REDACTED/jupyter/./cputest.py", line 7, in <module>
    response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
AttributeError: 'NoneType' object has no attribute 'predict'

Running the same command with device=’xpu:0’ shows the following:

/home/REDACTED/miniconda3/envs/jupyter2/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
/home/REDACTED/miniconda3/envs/jupyter2/lib/python3.9/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
  warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
Loading config settings from the environment...
2024-02-14 10:20:00.620315: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-14 10:20:00.623828: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-14 10:20:00.671369: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-14 10:20:00.671411: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-14 10:20:00.672846: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-14 10:20:00.681503: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-14 10:20:00.681998: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-14 10:20:01.604245: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-02-14 10:20:05,906 - datasets - INFO - PyTorch version 2.1.0a0+cxx11.abi available.
2024-02-14 10:20:05,906 - datasets - INFO - TensorFlow version 2.15.0.post1 available.
Loading model Intel/neural-chat-7b-v3-1
2024-02-14 10:20:06 [ERROR] neuralchat error: Device is not supported
Traceback (most recent call last):
  File "/home/REDACTED/jupyter/./cputest.py", line 7, in <module>
    response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
AttributeError: 'NoneType' object has no attribute 'predict'

Some additional system debug showing proper operation of the Flex 170:

$ sudo xpu-smi discovery
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Data Center GPU Flex 170                                       |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0000-d9d5-e18be95b77d2                                       |
|           | PCI BDF Address: 0000:b3:00.0                                                        |
|           | DRM Device: /dev/dri/card1                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
$ sudo xpu-smi stats -d 0
+-----------------------------+--------------------------------------------------------------------+
| Device ID                   | 0                                                                  |
+-----------------------------+--------------------------------------------------------------------+
| GPU Utilization (%)         | 0                                                                  |
| EU Array Active (%)         | N/A                                                                |
| EU Array Stall (%)          | N/A                                                                |
| EU Array Idle (%)           | N/A                                                                |
|                             |                                                                    |
| Compute Engine Util (%)     | 0; Engine 0: 0, Engine 1: 0, Engine 2: 0, Engine 3: 0              |
| Render Engine Util (%)      | 0; Engine 0: 0                                                     |
| Media Engine Util (%)       | 0                                                                  |
| Decoder Engine Util (%)     | Engine 0: 0, Engine 1: 0                                           |
| Encoder Engine Util (%)     | Engine 0: 0, Engine 1: 0                                           |
| Copy Engine Util (%)        | 0; Engine 0: 0                                                     |
| Media EM Engine Util (%)    | Engine 0: 0, Engine 1: 0                                           |
| 3D Engine Util (%)          | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| Reset                       | N/A                                                                |
| Programming Errors          | N/A                                                                |
| Driver Errors               | N/A                                                                |
| Cache Errors Correctable    | N/A                                                                |
| Cache Errors Uncorrectable  | N/A                                                                |
| Mem Errors Correctable      | N/A                                                                |
| Mem Errors Uncorrectable    | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| GPU Power (W)               | 42                                                                 |
| GPU Frequency (MHz)         | 2050                                                               |
| Media Engine Freq (MHz)     | 1025                                                               |
| GPU Core Temperature (C)    | 58                                                                 |
| GPU Memory Temperature (C)  | N/A                                                                |
| GPU Memory Read (kB/s)      | 1452                                                               |
| GPU Memory Write (kB/s)     | 400                                                                |
| GPU Memory Bandwidth (%)    | 0                                                                  |
| GPU Memory Used (MiB)       | 31                                                                 |
| GPU Memory Util (%)         | 0                                                                  |
| Xe Link Throughput (kB/s)   | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+

$ sudo xpu-smi health -d 0
+----------------------------+---------------------------------------------------------------------+
| Device ID                  | 0                                                                   |
+----------------------------+---------------------------------------------------------------------+
| 1. GPU Core Temperature    | Status: OK                                                          |
|                            | Description: All temperature sensors are healthy.                   |
|                            | Throttle Threshold: 100 Celsius Degree                              |
|                            | Shutdown Threshold: 125 Celsius Degree                              |
+----------------------------+---------------------------------------------------------------------+
| 3. GPU Power               | Status: OK                                                          |
|                            | Description: All power domains are healthy.                         |
|                            | Throttle Threshold: 150 watts                                       |
+----------------------------+---------------------------------------------------------------------+
| 6. GPU Frequency           | Status: OK                                                          |
|                            | Description: The device frequency not throttled                     |
+----------------------------+---------------------------------------------------------------------+
$ sudo xpu-smi diag --precheck
Journal file [/var/log/journal/90338a962e854ed39e4e7ece1f53d71e/user-1666601109@000610bd6beba70f-62677c8b509c641c.journal~](mailto:/var/log/journal/90338a962e854ed39e4e7ece1f53d71e/user-1666601109@000610bd6beba70f-62677c8b509c641c.journal~) is truncated, ignoring file.
Journal file [/var/log/journal/90338a962e854ed39e4e7ece1f53d71e/user-1666601109@000610bd6beba70f-62677c8b509c641c.journal~](mailto:/var/log/journal/90338a962e854ed39e4e7ece1f53d71e/user-1666601109@000610bd6beba70f-62677c8b509c641c.journal~) is truncated, ignoring file.
+------------------+-------------------------------------------------------------------------------+
| Component        | Details                                                                       |
+------------------+-------------------------------------------------------------------------------+
| Driver           | Status: Pass                                                                  |
+------------------+-------------------------------------------------------------------------------+
| CPU              | CPU ID: 0                                                                     |
|                  | Status: Pass                                                                  |
+------------------+-------------------------------------------------------------------------------+
| CPU              | CPU ID: 1                                                                     |
|                  | Status: Pass                                                                  |
+------------------+-------------------------------------------------------------------------------+
| GPU              | BDF: 0000:b3:00.0                                                             |
|                  | Status: Pass                                                                  |
+------------------+-------------------------------------------------------------------------------+
$ sudo xpu-smi diag -d 0 -l 3
+-------------------------------+------------------------------------------------------------------+
| Device ID                     | 0                                                                |
+-------------------------------+------------------------------------------------------------------+
| Level                         | 3                                                                |
| Result                        | Pass                                                             |
| Items                         | 12                                                               |
+-------------------------------+------------------------------------------------------------------+
| Software Env Variables        | Result: Pass                                                     |
|                               | Message: Pass to check environment variables.                    |
+-------------------------------+------------------------------------------------------------------+
| Software Library              | Result: Pass                                                     |
|                               | Message: Pass to check libraries.                                |
+-------------------------------+------------------------------------------------------------------+
| Software Permission           | Result: Pass                                                     |
|                               | Message: Pass to check permission.                               |
+-------------------------------+------------------------------------------------------------------+
| Software Exclusive            | Result: Pass                                                     |
|                               | Message: Pass to check the software exclusive.                   |
+-------------------------------+------------------------------------------------------------------+
| Computation Check             | Result: Pass                                                     |
|                               | Message: Pass to check computation.                              |
+-------------------------------+------------------------------------------------------------------+
| Integration PCIe              | Result: Pass                                                     |
|                               | Message: Pass to check PCIe bandwidth. Its bandwidth is 17.908   |
|                               |   GBPS.                                                          |
+-------------------------------+------------------------------------------------------------------+
| Media Codec                   | Result: Pass                                                     |
|                               | Message: Pass to check Media transcode performance.              |
|                               |  1080p H.265 : 305 FPS                                           |
|                               |  1080p H.264 : 306 FPS                                           |
|                               |  4K H.265 : 85 FPS                                               |
|                               |  4K H.264 : 84 FPS                                               |
+-------------------------------+------------------------------------------------------------------+
| Performance Computation       | Result: Pass                                                     |
|                               | Message: Pass to check computation performance. Its              |
|                               |   single-precision GFLOPS is 11120.119.                          |
+-------------------------------+------------------------------------------------------------------+
| Performance Power             | Result: Pass                                                     |
|                               | Message: Pass to check stress power. Its stress power is 119 W.  |
+-------------------------------+------------------------------------------------------------------+
| Performance Memory Bandwidth  | Result: Pass                                                     |
|                               | Message: Pass to check memory bandwidth. Its memory bandwidth    |
|                               |   is 361.042 GBPS.                                               |
+-------------------------------+------------------------------------------------------------------+
| Performance Memory Allocation | Result: Pass                                                     |
|                               | Message: Pass to check memory allocation.                        |
+-------------------------------+------------------------------------------------------------------+
| Memory Error                  | Result: Pass                                                     |
|                               | Message: Pass to check memory error.                             |
+-------------------------------+------------------------------------------------------------------+
@huiyan2021
Copy link
Contributor

Thanks for reporting this issue, it can be re-produced.

“Device does not exist” dues to this line: https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/models/model_utils.py#L441

“Device is not supported” when setting device=’xpu:0’ because only "xpu" is considered.

Will fix above 2 issues asap!

@huiyan2021
Copy link
Contributor

seems torch.xpu.is_available() works in .py script but does not work well in jupyter notebook

and in .py script, the issue is that the model is too large to be fit in Intel Data Center Flex 170 GPU using Intel Extension for PyTorch
image

@huiyan2021
Copy link
Contributor

@brent-elliott seems that removing libstdc++* from /home/REDACTED/miniconda3/envs/jupyter2/lib can fix the issue that torch.xpu.is_available() return false in jupyter notebook

@brent-elliott
Copy link
Author

brent-elliott commented Feb 19, 2024

@brent-elliott seems that removing libstdc++* from /home/REDACTED/miniconda3/envs/jupyter2/lib can fix the issue that torch.xpu.is_available() return false in jupyter notebook

Thank you. Removing these files resolved the issue of torch.xpu.is_available returning False. Do you know if this is already captured somewhere in the documentation or notebook to remove these files that I missed?

@brent-elliott
Copy link
Author

Thanks for reporting this issue, it can be re-produced.

“Device does not exist” dues to this line: https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/models/model_utils.py#L441

“Device is not supported” when setting device=’xpu:0’ because only "xpu" is considered.

Will fix above 2 issues asap!

Thank you! I have shifted to using xpu in my scripts in the meantime.

@huiyan2021
Copy link
Contributor

@brent-elliott seems that removing libstdc++* from /home/REDACTED/miniconda3/envs/jupyter2/lib can fix the issue that torch.xpu.is_available() return false in jupyter notebook

Thank you. Removing these files resolved the issue of torch.xpu.is_available returning False. Do you know if this is already captured somewhere in the documentation or notebook to remove these files that I missed?

The conflict of libstdc++.so version between conda environment and OS is a known issue that you can find in IPEX documentation: https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/performance_tuning/known_issues.html

In the build_chatbot_on_xpu Jupyter notebook, there is also a Notes at the end of Prepare Environment session but may not be well-marked:
Notes: If you face "GLIBCXX_3.4.30" not found issue in conda environment, please remove lib/libstdc++* from conda environment.

While this time the issue appears in a different way, still figuring out the root cause......

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants