Limit custom LUA error stats while allowing non LUA error stats to function normally #500

KarthikSubbarao · 2024-05-15T08:59:48Z

Implementing the change proposed here: #487

In this PR, we prevent tracking new error messages from LUA if the number of error messages (in the errors RAX) is greater than 128. Instead, we will track any additional LUA error type in a new counter: errorstat_LUA_ERRORSTATS_DISABLED and if a non-LUA error (e.g. MOVED / CLUSTERDOWN) occurs, they will continue to be tracked as usual.

This will address the issue of spammed error messages / memory usage of the errors RAX. Additionally, we will not have to execute CONFIG RESETSTAT to restore error stats functionality because normal error messages continue to be tracked.

Example:

# Errorstats
.
.
.
errorstat_127:count=2
errorstat_128:count=2
errorstat_ERR:count=1
errorstat_LUA_ERRORSTATS_DISABLED:count=2

KarthikSubbarao · 2024-05-15T09:25:25Z

(Force pushed because this was recommended by the bot here since I did not include the commit sign off originally. Also because it is not reviewed yet)

src/networking.c

codecov · 2024-05-18T13:29:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.82%. Comparing base (6e4a610) to head (3ceea58).
Report is 10 commits behind head on unstable.

❗ Current head 3ceea58 differs from pull request most recent head a03333b

Please upload reports for the commit a03333b to get more accurate results.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #500      +/-   ##
============================================
- Coverage     69.92%   69.82%   -0.10%     
============================================
  Files           109      109              
  Lines         61797    61781      -16     
============================================
- Hits          43211    43140      -71     
- Misses        18586    18641      +55

Files	Coverage Δ
src/networking.c	`85.05% <100.00%> (+0.10%)`	⬆️
src/script_lua.c	`90.26% <100.00%> (ø)`
src/server.c	`88.53% <ø> (-0.08%)`	⬇️

... and 13 files with indirect coverage changes

ranshid · 2024-05-19T10:26:23Z

@KarthikSubbarao Overall LGTM. I do think we need to add more tests so that future changes will not introduce degradation.
For example lets include function and redis.call cases ARE getting into errorStats.

srgsanky

lgtm. My comments are minor.

src/server.h

tests/unit/info.tcl

src/script_lua.c

src/networking.c

KarthikSubbarao · 2024-05-20T00:28:47Z

@KarthikSubbarao Overall LGTM. I do think we need to add more tests so that future changes will not introduce degradation. For example lets include function and redis.call cases ARE getting into errorStats.

@ranshid - When functions are used and contain LUA code with server.error_reply with custom error messages, they will still be caught by this new logic when we are past the 128 limit.
If the error is not from LUA (e.g. syntax error in server.call), it will continue to be tracked

Did you mean add test cases where functions (with lua) are tracked under errorstat_LUA_ERRORSTATS_OVERFLOW when the errors RAX is over the limit?

Example:

for ((i=1; i<=128; i++))                                                            
do
    ./valkey-cli EVAL "return server.error_reply('$i a');" 0
done

% cat customerr.lua 
#!lua name=mylib
redis.register_function(
  'custom_error',
  function() return server.error_reply('customerror 0') end
)

cat customerr.lua | ./valkey-cli -x FUNCTION LOAD REPLACE

% ./valkey-cli                                                                        
127.0.0.1:6379> fcall custom_error 0
(error) customerror 0
127.0.0.1:6379> info errorstats
errorstat_1:count=1
.
.
.
errorstat_128:count=1
errorstat_LUA_ERRORSTATS_OVERFLOW:count=1

ranshid · 2024-05-20T05:59:25Z

@KarthikSubbarao Overall LGTM. I do think we need to add more tests so that future changes will not introduce degradation. For example lets include function and redis.call cases ARE getting into errorStats.

@ranshid - When functions are used and contain LUA code with server.error_reply with custom error messages, they will still be caught by this new logic when we are past the 128 limit. If the error is not from LUA (e.g. syntax error in server.call), it will continue to be tracked

Did you mean add test cases where functions (with lua) are tracked under errorstat_LUA_ERRORSTATS_OVERFLOW when the errors RAX is over the limit?

Example:
for ((i=1; i<=128; i++))                                                            
do
    ./valkey-cli EVAL "return server.error_reply('$i a');" 0
done
% cat customerr.lua 
#!lua name=mylib
redis.register_function(
  'custom_error',
  function() return server.error_reply('customerror 0') end
)
cat customerr.lua | ./valkey-cli -x FUNCTION LOAD REPLACE 
% ./valkey-cli                                                                        
127.0.0.1:6379> fcall custom_error 0
(error) customerror 0
127.0.0.1:6379> info errorstats
errorstat_1:count=1
.
.
.
errorstat_128:count=1
errorstat_LUA_ERRORSTATS_OVERFLOW:count=1

@KarthikSubbarao I think this is somewhat problematic. Functions are more like modules IMO and I think we should allow function errors to overflow. I think maybe we can flag the client (or check if we are in the context of eval/evalsha) in order to enforce the overflow?

KarthikSubbarao · 2024-05-20T15:30:06Z

@KarthikSubbarao I think this is somewhat problematic. Functions are more like modules IMO and I think we should allow function errors to overflow. I think maybe we can flag the client (or check if we are in the context of eval/evalsha) in order to enforce the overflow?

Functions are still using LUA and are using the same APIs such as server.error_reply to reply with custom errors. Because of this, we can still get into the spamming error section output with functions - as we do with EVAL/EVALSHA.

To handle this, when a server exceeds the limit, when functions are used with additional custom errors, they are tracked under errorstat_LUA_ERRORSTATS_OVERFLOW.

IMO, this behavior makes sense - but I am also curious to hear from others

…es while allowing non LUA errors to function as usual Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

KarthikSubbarao · 2024-05-20T21:05:08Z

Sorry for the force push. I did not include the sign off on the previous commit and the DCO check required this

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

KarthikSubbarao force-pushed the lua branch from 14c9484 to daae0d8 Compare May 15, 2024 09:24

ranshid reviewed May 16, 2024

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

srgsanky reviewed May 19, 2024

View reviewed changes

src/server.h Outdated Show resolved Hide resolved

tests/unit/info.tcl Outdated Show resolved Hide resolved

src/script_lua.c Outdated Show resolved Hide resolved

src/networking.c Show resolved Hide resolved

KarthikSubbarao added 4 commits May 20, 2024 20:47

Add support for limiting custom LUA errors when over 128 error messag…

b6e0f19

…es while allowing non LUA errors to function as usual Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

Minor clean up + update documentation

c86cb62

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

Rename LUA Error Stat overflow error message

bc0eaef

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

Update documentation, minor refactor, additional test case

1c77f69

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>

KarthikSubbarao force-pushed the lua branch from 3ceea58 to 1c77f69 Compare May 20, 2024 21:02

KarthikSubbarao added 2 commits May 21, 2024 04:06

update tests

8206c4a

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

Add tests for Valkey Functions

a03333b

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>

KarthikSubbarao requested review from srgsanky and ranshid May 22, 2024 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit custom LUA error stats while allowing non LUA error stats to function normally #500

Limit custom LUA error stats while allowing non LUA error stats to function normally #500

KarthikSubbarao commented May 15, 2024 •

edited

KarthikSubbarao commented May 15, 2024 •

edited

codecov bot commented May 18, 2024 •

edited

ranshid commented May 19, 2024

srgsanky left a comment

KarthikSubbarao commented May 20, 2024 •

edited

ranshid commented May 20, 2024

KarthikSubbarao commented May 20, 2024 •

edited

KarthikSubbarao commented May 20, 2024

Limit custom LUA error stats while allowing non LUA error stats to function normally #500

Are you sure you want to change the base?

Limit custom LUA error stats while allowing non LUA error stats to function normally #500

Conversation

KarthikSubbarao commented May 15, 2024 • edited

KarthikSubbarao commented May 15, 2024 • edited

codecov bot commented May 18, 2024 • edited

Codecov Report

ranshid commented May 19, 2024

srgsanky left a comment

Choose a reason for hiding this comment

KarthikSubbarao commented May 20, 2024 • edited

ranshid commented May 20, 2024

KarthikSubbarao commented May 20, 2024 • edited

KarthikSubbarao commented May 20, 2024

KarthikSubbarao commented May 15, 2024 •

edited

KarthikSubbarao commented May 15, 2024 •

edited

codecov bot commented May 18, 2024 •

edited

KarthikSubbarao commented May 20, 2024 •

edited

KarthikSubbarao commented May 20, 2024 •

edited