You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
I have found that my Salt Master servers running 3007.0 become unresponsive on a weekly basis after our internal vulnerability scans run (Tenable Vulnerability Management). This is very similar to the issue described in #64061 that was fixed in versions 3005.2/3006.2 (CVE-2023-20897).
I took a packet capture while running a scan against the server and noticed attempts to start TLS sessions with port 4506 is what triggers the errors below in /var/log/salt/master. The number of errors seems to be equal to the amount of worker processes configured on the master. This only seems to occur when the scan investigates TCP port 4506.
2024-05-13 16:47:48,935 [salt.transport.zeromq:572 ][ERROR ][189978] Exception in request handler
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
request = await asyncio.wait_for(self._socket.recv(), 0.3)
File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
result = recv(**kwargs)
File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable
2024-05-13 16:47:49,963 [salt.transport.zeromq:572 ][ERROR ][189979] Exception in request handler
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
request = await asyncio.wait_for(self._socket.recv(), 0.3)
File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
result = recv(**kwargs)
File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable
2024-05-13 16:47:49,963 [salt.transport.zeromq:572 ][ERROR ][189988] Exception in request handler
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
request = await asyncio.wait_for(self._socket.recv(), 0.3)
File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
result = recv(**kwargs)
File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable
2024-05-13 16:47:50,474 [salt.transport.zeromq:572 ][ERROR ][189989] Exception in request handler
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
request = await asyncio.wait_for(self._socket.recv(), 0.3)
File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
result = recv(**kwargs)
File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable
2024-05-13 16:47:50,730 [salt.transport.zeromq:572 ][ERROR ][189981] Exception in request handler
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
request = await asyncio.wait_for(self._socket.recv(), 0.3)
File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
result = recv(**kwargs)
File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable
Once these errors occur, the master service becomes completely unresponsive to minion requests. Attempting to issue commands from the affected Salt Master result in an error that the Master is not responding.
user@salt1:~$ sudo salt '*' test.ping
[ERROR ] Request client send timedout
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
Restarting the salt-master service resolves the issue.
Setup
I am running a Master Cluster with 4 servers built on Oracle Linux 8.9 and Salt 3007.0.
Please be as specific as possible and give set-up details.
on-prem machine
VM (VMware)
VM running on a cloud service, please be explicit and add details
container (Kubernetes, Docker, containerd, etc. please specify)
or a combination, please be explicit
jails if it is FreeBSD
classic packaging
onedir packaging
used bootstrap to install
Steps to Reproduce the behavior
Initiating a scan with Tenable against one of the master servers triggers this issue. Based on the similarity to #64061, I imagine a scan from Rapid7 InsightVM / Nexpose would also trigger the issue.
An easier way to reproduce the issue is to use openssl to attempt opening TLS connections to port 4506 in quick succession: for i in {1..30}; do openssl s_client -connect salt.example.com:4506 -tls1_2 </dev/null; sleep .2; done
A restart of the salt-master service brings it back to life.
Expected behavior
The Salt Master service should not become unresponsive when port 4506 is investigated by vulnerability scanners or receives other invalid requests.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
Salt: 3007.0Python Version:
Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]Dependency Versions:
cffi: 1.16.0cherrypy: unknowndateutil: 2.8.2docker-py: Not Installedgitdb: Not Installedgitpython: Not InstalledJinja2: 3.1.3libgit2: 1.7.2looseversion: 1.3.0M2Crypto: Not InstalledMako: Not Installedmsgpack: 1.0.7msgpack-pure: Not Installedmysql-python: Not Installedpackaging: 23.1pycparser: 2.21pycrypto: Not Installedpycryptodome: 3.19.1pygit2: 1.14.1python-gnupg: 0.5.2PyYAML: 6.0.1PyZMQ: 25.1.2relenv: 0.15.1smmap: Not Installedtimelib: 0.3.0Tornado: 6.3.3ZMQ: 4.3.4Salt Package Information:
Package Type: onedirSystem Versions:
dist: oracle 8.9locale: utf-8machine: x86_64release: 5.15.0-205.149.5.1.el8uek.x86_64system: Linuxversion: Oracle Linux Server 8.9
Additional context
This only seems to affect version 3007.0. I tested with versions 3005.5 and 3006.8 and they log the following messages when attempting to reproduce the issue, but do not become unresponsive.
2024-05-13 19:02:56,152 [salt.payload :111 ][CRITICAL][1358] Could not deserialize msgpack message. This often happens when trying to read a file not in binary mode. To see message payload, enable debug logging and retry. Exception: unpack(b) received extra data.
The text was updated successfully, but these errors were encountered:
I use same oneliner script to bring down 3007.1 salt-master, 3006.8 withstand this test. A salt-master restart will restore the service.
for i in {1..30}; do openssl s_client -connect salt.example.com:4506 -tls1_2 </dev/null; sleep .2; done
one-liner test still can bring down 4506 after OS upgraded from RockyLinux 8.9 to 8.10(released a few days ago).
Here is the log with trace enabled for log file. following logs showing zmq backend resource unavailable when one-liner script hit.
2024-06-02 09:33:05,632 [salt.transport.zeromq:572 ][ERROR ][42819] Exception in request handler
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
request = await asyncio.wait_for(self._socket.recv(), 0.3)
File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
result = recv(**kwargs)
File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable
2024-06-02 09:33:11,074 [salt.utils.process:32 ][TRACE ][42741] Process manager iteration
Description
I have found that my Salt Master servers running 3007.0 become unresponsive on a weekly basis after our internal vulnerability scans run (Tenable Vulnerability Management). This is very similar to the issue described in #64061 that was fixed in versions 3005.2/3006.2 (CVE-2023-20897).
I took a packet capture while running a scan against the server and noticed attempts to start TLS sessions with port 4506 is what triggers the errors below in /var/log/salt/master. The number of errors seems to be equal to the amount of worker processes configured on the master. This only seems to occur when the scan investigates TCP port 4506.
Once these errors occur, the master service becomes completely unresponsive to minion requests. Attempting to issue commands from the affected Salt Master result in an error that the Master is not responding.
Restarting the salt-master service resolves the issue.
Setup
I am running a Master Cluster with 4 servers built on Oracle Linux 8.9 and Salt 3007.0.
Please be as specific as possible and give set-up details.
Steps to Reproduce the behavior
Initiating a scan with Tenable against one of the master servers triggers this issue. Based on the similarity to #64061, I imagine a scan from Rapid7 InsightVM / Nexpose would also trigger the issue.
An easier way to reproduce the issue is to use openssl to attempt opening TLS connections to port 4506 in quick succession:
for i in {1..30}; do openssl s_client -connect salt.example.com:4506 -tls1_2 </dev/null; sleep .2; done
A restart of the salt-master service brings it back to life.
Expected behavior
The Salt Master service should not become unresponsive when port 4506 is investigated by vulnerability scanners or receives other invalid requests.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Additional context
This only seems to affect version 3007.0. I tested with versions 3005.5 and 3006.8 and they log the following messages when attempting to reproduce the issue, but do not become unresponsive.
The text was updated successfully, but these errors were encountered: