You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently updated our app to add an extra websocket connection to our backend per client. Over the next hour, the CPU usage on our haproxy server increased from around 30% to 85%. After a few hours, it increased even more to 100%, causing an outage.
CPU Usage graph (rolled out around 14:20)
Here's a graph of rate of incoming WS connections. I've cut off the Y axis at 20k / min, because it goes way higher after restarts / deployments
Data transferred during that time
Active sessions
Expected Behavior
We expected the CPU usage to increase only a small amount, eg from 30% to 40%. Even that seems excessive, since not all our traffic is websockets, and the new websockets don't actually do anything.
Steps to Reproduce the Behavior
It won't be easy for you to reproduce it. For us, we can reproduce it quite reliably, by releasing this code and getting clients to start sending the requests. But we can't just try random things because it's actively serving our customers.
Do you have any idea what may have caused this?
A few things we've considered:
There might be a bug with the new websocket causing it to spam reconnections. But from our logs and from trying it ourselves, the connections seem to work ok
There might have been a bug causing the new websockets to spam tons of traffic. But from our logs, the number of websocket connections matches the number of messages received exactly (so each websocket is sending one message as expected)
There might have been a bug causing tons of PING / PONG traffic that wouldn't show up in logs. But from what we can see when testing it ourselves, the pings are sent every 25s as intended. (Pings are sent by server).
We thought maybe we were hitting some kind of internal limit, which was causing sockets to be queued, leading to high CPU. We increased maxconn, but this didn't do anything.
I saw that there were some previous performance issues with openSSL 3, which we are using. So maybe it's that, but it seems very suspicious that it would cause such a non-linear increase in CPU usage
Do you have an idea how to solve the issue?
No response
What is your configuration?
global
log /dev/log len 65535 local0 info chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode660 level admin expose-fd listeners
stats timeout30s user haproxy
group haproxy
daemon hard-stop-after1m# Default SSL material locations ca-base /etc/ssl/certs
crt-base /etc/ssl/private
# See: https://ssl-config.mozilla.org/#server=haproxy&version=2.4&config=old&openssl=3.0.2&hsts=false&ocsp=false&guideline=5.6 ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-CHACHA20-POLY1305
ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
ssl-default-bind-options prefer-client-ciphers no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets ssl-default-server-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-CHACHA20-POLY1305
ssl-default-server-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
ssl-default-server-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
defaults
log global#log-format '{"host":"%H","ident":"haproxy","pid":%pid,"time":"%Tl","haproxy":{"conn":{"act":%ac,"fe":%fc,"be":%bc,"srv":%sc},"queue":{"backend":%bq,"srv":%sq},"time":{"tq":%Tq,"tw":%Tw,"tc":%Tc,"tr":%Tr,"tt":%Tt},"termination_state":"%tsc","retries":%rc,"network":{"client_ip":"%ci","client_port":%cp,"frontend_ip":"%fi","frontend_port":%fp},"ssl":{"version":"%sslv","ciphers":"%sslc"},"request":{"method":"%HM","uri":"%[capture.req.uri,json(utf8s)]","protocol":"%HV","header":{"host":"%[capture.req.hdr(0),json(utf8s)]","xforwardfor":"%[capture.req.hdr(1),json(utf8s)]","referer":"%[capture.req.hdr(2),json(utf8s)]","traceparent": %ID""}},"name":{"backend":"%b","frontend":"%ft","server":"%s"},"response":{"status_code":%ST,"traceparent":"%[capture.res.hdr(0),json(utf8s)]"},"bytes":{"uploaded":%U,"read":%B}}}' option httplog mode http option forwardfor option dontlognull timeout connect5000 timeout client50000 timeout server50000 timeout client-fin30s timeout tunnel3600s maxconn400000 errorfile400 /etc/haproxy/errors/400.http
errorfile403 /etc/haproxy/errors/403.http
errorfile408 /etc/haproxy/errors/408.http
errorfile500 /etc/haproxy/errors/500.http
errorfile502 /etc/haproxy/errors/502.http
errorfile503 /etc/haproxy/errors/503.http
errorfile504 /etc/haproxy/errors/504.http
cache assets_cache# Total size of the cache in MB total-max-size4# Max size of any single item in bytes# 30 kilobytes max-object-size30720# Time to live for each item in seconds max-age2630000# Set max-age to 1 month
frontend stats bind xx.xxx.xxx.xxx:xxxx
option http-use-htx http-request set-log-level silent
http-request use-service prometheus-exporter if { path /metrics }
stats enable stats uri /stats
stats refresh10s
frontend talkjs bind:80 bind :::80 bind:443 ssl crt /etc/ssl/certs/talkjs.com.2024.pem alpn h2,http/1.1 bind :::443 ssl crt /etc/ssl/certs/talkjs.com.2024.pem alpn h2,http/1.1 stick-table type string len 200 size100k expire10s store http_req_rate(10s)
# Rate limit per-path, only on internal api# Excludes websockets because everyone frantically tries to reconnect while we are releasing# Allow 1000 requests per 10s per path acl is_internal_api path_beg -i /api/v0
acl is_websocket path_end -i /socket/websocket
acl exceeds_limit path,table_http_req_rate() gt 1000 http-request track-sc0 pathunless is_internal_api !is_websocket exceeds_limit
http-request deny deny_status 429if is_internal_api !is_websocket exceeds_limit
# Captures to satisfy our log format capture request header Host len 40 capture request header X-Forwarded-For len 50 capture request header Referer len 200 capture request header User-Agent len 200 capture request header Authorization len 200 capture request header x-talkjs-client-build len 30 capture request header x-talkjs-client-date len 20# Catch traceparent set by backend, used for logging capture response header traceparent len 50 acl is_assets path_beg -i /__assets
acl is_mp3 path_end -i .mp3
http-request cache-use assets_cache if is_assets is_mp3
http-response cache-store assets_cache
http-response set-header X-Cache-Status HIT if !{ srv_id -m found }
http-response set-header X-Cache-Status MISS if { srv_id -m found }
# parse UUID [RFC 4122](https://www.rfc-editor.org/rfc/rfc4122.html) http-request set-var(txn.trace_id) uuid(4),regsub(\"^([^-]{8})-([^-]{4})-([^-]{4})-([^-]{4})-([^-]{12})$\",\"\1\2\3\4\5\")
http-request set-var(txn.parent_id) uuid(4),regsub(\"^([^-]{8})-([^-]{4})-([^-]{4})-([^-]{4})-([^-]{12})$\",\"\1\2\3\")
unique-id-format00-%[var(txn.trace_id)]-%[var(txn.parent_id)]-03 unique-id-header traceparent
# Forward traceparent to backend http-request set-header traceparent %ID
# return traceparent to the client
http-after-response set-header traceparent %ID
# Tell the backend if a request was http or https http-request set-header X-Forwarded-Proto httpif !{ ssl_fc }
http-request set-header X-Forwarded-Proto https if { ssl_fc }
# Inject this header on the way back so we can always see which node handled the response
http-after-response set-header X-TalkJS-Node %s
default_backend backend-1# round robin assets requests use_backend assets-backend if is_assets
# sharding.map is a file with regex parts so backend can be overridden. we# route it either using a cookie set by the dashboard, or by checking if the# app_id in sharding.map is in the URL. not all dashboard paths have an app_id# in the URL so it also sets the cookie for good measure. acl has_cookie req.cook(lb_last_active_test_app_id) -m found
use_backend%[req.cook(lb_last_active_test_app_id),map_reg(/etc/haproxy/maps/sharding.map)]if has_cookie
use_backend%[path,map_reg(/etc/haproxy/maps/sharding.map)]unless has_cookie
backend assets-backend balance roundrobin server production-1-6080 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-1-6081 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-1-6082 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-1-6083 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-2-6080 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-2-6081 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-2-6082 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-2-6083 XX.XXX.XXX.XXX:XXXX check maxconn63000# Uncomment these lines if you want to serve assets from the prod 3 node# Since we do roundrobin, enabling these would directly send requests# towards the node.# server production-3-6080 XX.XXX.XXX.XXX:XXXX check maxconn 63000# server production-3-6081 XX.XXX.XXX.XXX:XXXX check maxconn 63000# server production-3-6082 XX.XXX.XXX.XXX:XXXX check maxconn 63000# server production-3-6083 XX.XXX.XXX.XXX:XXXX check maxconn 63000
backend backend-1 server production-1-6080 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-1-6081 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-1-6082 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-1-6083 XX.XXX.XXX.XXX:XXXX check maxconn63000
backend backend-2 server production-2-6080 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-2-6081 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-2-6082 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-2-6083 XX.XXX.XXX.XXX:XXXX check maxconn63000
backend backend-3 server production-3-6080 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-3-6081 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-3-6082 XX.XXX.XXX.XXX:XXXX check maxconn63000 server production-3-6083 XX.XXX.XXX.XXX:XXXX check maxconn63000
backend blackhole# No servers defined. You can point users here as a last resort# if they are causing high load or other issues, and haproxy# will reject all their requests# Keep a newline here
Output of haproxy -vv
HAProxy version 2.4.22-0ubuntu0.22.04.3 2023/12/04 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2026.
Known bugs: http://www.haproxy.org/bugs/bugs-2.4.22.html
Running on: Linux 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64
Build options :
TARGET = linux-glibc
CPU = generic
CC = cc
CFLAGS = -O2 -g -O2 -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_OPENSSL=1 USE_LUA=1 USE_SLZ=1 USE_SYSTEMD=1 USE_PROMEX=1
DEBUG =
Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL +EPOLL -EVPORTS +FUTEX +GETADDRINFO -KQUEUE +LIBCRYPT +LINUX_SPLICE +LINUX_TPROXY +LUA -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OT -PCRE +PCRE2 +PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PRIVATE_CACHE -PROCCTL +PROMEX -PTHREAD_PSHARED -QUIC +RT +SLZ -STATIC_PCRE -STATIC_PCRE2 +SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB
Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 3.0.2 15 Mar 2022
Running on OpenSSL version : OpenSSL 3.0.2 15 Mar 2022
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.6
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.39 2021-10-29
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 11.4.0
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE|BE mux=H2 flags=HTX|CLEAN_ABRT|HOL_RISK|NO_UPG
fcgi : mode=HTTP side=BE mux=FCGI flags=HTX|HOL_RISK|NO_UPG
h1 : mode=HTTP side=FE|BE mux=H1 flags=HTX|NO_UPG
<default> : mode=HTTP side=FE|BE mux=H1 flags=HTX
none : mode=TCP side=FE|BE mux=PASS flags=NO_UPG
<default> : mode=TCP side=FE|BE mux=PASS flags=
Available services : prometheus-exporter
Available filters :
[SPOE] spoe
[CACHE] cache
[FCGI] fcgi-app
[COMP] compression
[TRACE] trace
Last Outputs and Backtraces
No response
Additional Information
Before this update, clients had 1 WS connection by default, and 1 additional WS per UI. So most clients had 2. This update made it 2 connections by default, and 1 additional per UI.
The new WS is incredibly simple. It's a JS WebSocket that sends 1 message on startup (here is my token please authenticate) and gets one response. The server sends PINGs every 25s and the client responds with PONG. No other traffic happens.
The 2 pre-existing websockets are much heavier. They are Phoenix websockets, with a significant amount of traffic in both directions.
We are operating at scale. At peak times, the haproxy server has ~150k active connections
The haproxy server is pretty beefy, 8 CPUs with 32GB of RAM
After the rollout of this new websocket, we were receiving around 10k phoenix WS connections per minute, and 3.5k of the new WS connections per minute.
There are also HTTP requests, in addition to the WS connections, but the same amount before / after releasing this new websocket
We attempted and reverted this rollout 3 times, and the same pattern repeated each time
We have been running this new websocket on about 10% of our clients for a week before this full rollout, and didn't notice any issues
This only causes high CPU on haproxy, not on any of the actual backend nodes. So it seems unlikely that it's a legit huge increase in traffic, as that would cause an increase in backend CPU too.
The new WS is routed via cloudflare whereas the phoenix WS are not. Also a weird thing we noticed, when using chrome, the phoenix WS show up as getting a 200 status response, while on firefox they show up as 101. Whereas the new WS shows as a 101 response on both browsers. But this seems more like a browser bug than anything real. They're still being used as websockets (and it's the old websockets being weird not the new ones)
The average connection time reported by haproxy also increases every time we try to roll it out
We ran the haproxy profiler twice. After rollout, with high CPU:
After revert, with lower CPU. CPU takes some time to go back down because existing clients will continue to open the WS connection. This was a few hours after reverting, with CPU back down around 30%:
We noticed the huge difference in process_stream and especially other, but can't find any documentation to help us track down what exactly is slow.
After reverting, the CPU usage doesn't decrease smoothly. It's blocky, eg dropping from 55% to 35% in a minute, hours after the revert
And for proof that these websockets are causing the high CPU, we:
blocked all new WS requests via cloudflare
restarting the backend node, killing all active websockets
observing that CPU instantly dropped from 35% to 15% (this was after reverting so only about 10% of clients were trying to open this websocket)
un-blocking the requests
observing the CPU instantly jump back to 35%
Reading all this back, it really does just look like a websocket that is extremely computationally heavy, transferring a ton of data etc. But I feel the need to stress, it's doing almost nothing. The client opens a connection, sends one message like [0,"POST","/session/renew",{"token":"eyJhbGciOiJIUzI1NiIsInR5cCI6... and gets one response like [0,200,{"user":{"id":"e21b106b-fd2b-459a-a5b0-3886b5f880fa","name":"Visitor.... And then the WS just sits there, open. And in that same time the two phoenix websockets have sent 40 messages.
You can see this for yourself, if you go to www.talkjs.com and look at dev tools network tab for websockets. The one with URL starting wss://api.talkjs.com/v1/efdOVFl/realtime is the new one.
The text was updated successfully, but these errors were encountered:
Yes, we actually already started working on updating to see if it would help. We'll update and re-release and let you know what happens. Probably on Monday. Thanks!
We defined additional servers to reduce the number of active sessions per server. This looked promising but once we rolled out the new WS again, CPU usage went up.
Commented out our stick-table and all associated lines of config. This immediately resolved the problem, dropping CPU down to around 30%, and it stayed there. We're not sure why that had any impact, since the new websocket isn't included in the rate limiting.
We then started to get errors after hitting 100k active connections on one backend, which we assume is nothing to do with haproxy and is instead some missing tweak on the application server, eg file descriptors.
But yeah, I'd be interested if you have any ideas why the stick table lines in the config could cause such a huge increase in CPU usage, from relatively few additional connections
Detailed Description of the Problem
We recently updated our app to add an extra websocket connection to our backend per client. Over the next hour, the CPU usage on our haproxy server increased from around 30% to 85%. After a few hours, it increased even more to 100%, causing an outage.
CPU Usage graph (rolled out around 14:20)
Here's a graph of rate of incoming WS connections. I've cut off the Y axis at 20k / min, because it goes way higher after restarts / deployments
Data transferred during that time
Active sessions
Expected Behavior
We expected the CPU usage to increase only a small amount, eg from 30% to 40%. Even that seems excessive, since not all our traffic is websockets, and the new websockets don't actually do anything.
Steps to Reproduce the Behavior
It won't be easy for you to reproduce it. For us, we can reproduce it quite reliably, by releasing this code and getting clients to start sending the requests. But we can't just try random things because it's actively serving our customers.
Do you have any idea what may have caused this?
A few things we've considered:
PING
/PONG
traffic that wouldn't show up in logs. But from what we can see when testing it ourselves, the pings are sent every 25s as intended. (Pings are sent by server).maxconn
, but this didn't do anything.Do you have an idea how to solve the issue?
No response
What is your configuration?
Output of
haproxy -vv
Last Outputs and Backtraces
No response
Additional Information
WebSocket
that sends 1 message on startup (here is my token please authenticate) and gets one response. The server sendsPING
s every 25s and the client responds withPONG
. No other traffic happens.The average connection time reported by haproxy also increases every time we try to roll it out
We ran the haproxy profiler twice. After rollout, with high CPU:
After revert, with lower CPU. CPU takes some time to go back down because existing clients will continue to open the WS connection. This was a few hours after reverting, with CPU back down around 30%:
We noticed the huge difference in
process_stream
and especiallyother
, but can't find any documentation to help us track down what exactly is slow.After reverting, the CPU usage doesn't decrease smoothly. It's blocky, eg dropping from 55% to 35% in a minute, hours after the revert
And for proof that these websockets are causing the high CPU, we:
Reading all this back, it really does just look like a websocket that is extremely computationally heavy, transferring a ton of data etc. But I feel the need to stress, it's doing almost nothing. The client opens a connection, sends one message like
[0,"POST","/session/renew",{"token":"eyJhbGciOiJIUzI1NiIsInR5cCI6...
and gets one response like[0,200,{"user":{"id":"e21b106b-fd2b-459a-a5b0-3886b5f880fa","name":"Visitor...
. And then the WS just sits there, open. And in that same time the two phoenix websockets have sent 40 messages.You can see this for yourself, if you go to www.talkjs.com and look at dev tools network tab for websockets. The one with URL starting
wss://api.talkjs.com/v1/efdOVFl/realtime
is the new one.The text was updated successfully, but these errors were encountered: