Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit custom LUA error stats while allowing non LUA error stats to function normally #500

Open
wants to merge 6 commits into
base: unstable
Choose a base branch
from

Conversation

KarthikSubbarao
Copy link

@KarthikSubbarao KarthikSubbarao commented May 15, 2024

Implementing the change proposed here: #487

In this PR, we prevent tracking new error messages from LUA if the number of error messages (in the errors RAX) is greater than 128. Instead, we will track any additional LUA error type in a new counter: errorstat_LUA_ERRORSTATS_DISABLED and if a non-LUA error (e.g. MOVED / CLUSTERDOWN) occurs, they will continue to be tracked as usual.

This will address the issue of spammed error messages / memory usage of the errors RAX. Additionally, we will not have to execute CONFIG RESETSTAT to restore error stats functionality because normal error messages continue to be tracked.

Example:

# Errorstats
.
.
.
errorstat_127:count=2
errorstat_128:count=2
errorstat_ERR:count=1
errorstat_LUA_ERRORSTATS_DISABLED:count=2

@KarthikSubbarao
Copy link
Author

KarthikSubbarao commented May 15, 2024

(Force pushed because this was recommended by the bot here since I did not include the commit sign off originally. Also because it is not reviewed yet)

src/networking.c Outdated Show resolved Hide resolved
Copy link

codecov bot commented May 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.82%. Comparing base (6e4a610) to head (3ceea58).
Report is 10 commits behind head on unstable.

Current head 3ceea58 differs from pull request most recent head a03333b

Please upload reports for the commit a03333b to get more accurate results.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #500      +/-   ##
============================================
- Coverage     69.92%   69.82%   -0.10%     
============================================
  Files           109      109              
  Lines         61797    61781      -16     
============================================
- Hits          43211    43140      -71     
- Misses        18586    18641      +55     
Files Coverage Δ
src/networking.c 85.05% <100.00%> (+0.10%) ⬆️
src/script_lua.c 90.26% <100.00%> (ø)
src/server.c 88.53% <ø> (-0.08%) ⬇️

... and 13 files with indirect coverage changes

@ranshid
Copy link
Contributor

ranshid commented May 19, 2024

@KarthikSubbarao Overall LGTM. I do think we need to add more tests so that future changes will not introduce degradation.
For example lets include function and redis.call cases ARE getting into errorStats.

Copy link

@srgsanky srgsanky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. My comments are minor.

src/server.h Outdated Show resolved Hide resolved
tests/unit/info.tcl Outdated Show resolved Hide resolved
src/script_lua.c Outdated Show resolved Hide resolved
src/networking.c Show resolved Hide resolved
@KarthikSubbarao
Copy link
Author

KarthikSubbarao commented May 20, 2024

@KarthikSubbarao Overall LGTM. I do think we need to add more tests so that future changes will not introduce degradation. For example lets include function and redis.call cases ARE getting into errorStats.

@ranshid - When functions are used and contain LUA code with server.error_reply with custom error messages, they will still be caught by this new logic when we are past the 128 limit.
If the error is not from LUA (e.g. syntax error in server.call), it will continue to be tracked

Did you mean add test cases where functions (with lua) are tracked under errorstat_LUA_ERRORSTATS_OVERFLOW when the errors RAX is over the limit?

Example:

for ((i=1; i<=128; i++))                                                            
do
    ./valkey-cli EVAL "return server.error_reply('$i a');" 0
done
% cat customerr.lua 
#!lua name=mylib
redis.register_function(
  'custom_error',
  function() return server.error_reply('customerror 0') end
)
cat customerr.lua | ./valkey-cli -x FUNCTION LOAD REPLACE 
% ./valkey-cli                                                                        
127.0.0.1:6379> fcall custom_error 0
(error) customerror 0
127.0.0.1:6379> info errorstats
errorstat_1:count=1
.
.
.
errorstat_128:count=1
errorstat_LUA_ERRORSTATS_OVERFLOW:count=1

@ranshid
Copy link
Contributor

ranshid commented May 20, 2024

@KarthikSubbarao Overall LGTM. I do think we need to add more tests so that future changes will not introduce degradation. For example lets include function and redis.call cases ARE getting into errorStats.

@ranshid - When functions are used and contain LUA code with server.error_reply with custom error messages, they will still be caught by this new logic when we are past the 128 limit. If the error is not from LUA (e.g. syntax error in server.call), it will continue to be tracked

Did you mean add test cases where functions (with lua) are tracked under errorstat_LUA_ERRORSTATS_OVERFLOW when the errors RAX is over the limit?

Example:

for ((i=1; i<=128; i++))                                                            
do
    ./valkey-cli EVAL "return server.error_reply('$i a');" 0
done
% cat customerr.lua 
#!lua name=mylib
redis.register_function(
  'custom_error',
  function() return server.error_reply('customerror 0') end
)
cat customerr.lua | ./valkey-cli -x FUNCTION LOAD REPLACE 
% ./valkey-cli                                                                        
127.0.0.1:6379> fcall custom_error 0
(error) customerror 0
127.0.0.1:6379> info errorstats
errorstat_1:count=1
.
.
.
errorstat_128:count=1
errorstat_LUA_ERRORSTATS_OVERFLOW:count=1

@KarthikSubbarao I think this is somewhat problematic. Functions are more like modules IMO and I think we should allow function errors to overflow. I think maybe we can flag the client (or check if we are in the context of eval/evalsha) in order to enforce the overflow?

@KarthikSubbarao
Copy link
Author

KarthikSubbarao commented May 20, 2024

@KarthikSubbarao I think this is somewhat problematic. Functions are more like modules IMO and I think we should allow function errors to overflow. I think maybe we can flag the client (or check if we are in the context of eval/evalsha) in order to enforce the overflow?

Functions are still using LUA and are using the same APIs such as server.error_reply to reply with custom errors. Because of this, we can still get into the spamming error section output with functions - as we do with EVAL/EVALSHA.

To handle this, when a server exceeds the limit, when functions are used with additional custom errors, they are tracked under errorstat_LUA_ERRORSTATS_OVERFLOW.

IMO, this behavior makes sense - but I am also curious to hear from others

…es while allowing non LUA errors to function as usual

Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
Signed-off-by: Karthik Subbarao <karthikrs2021@gmail.com>
@KarthikSubbarao
Copy link
Author

Sorry for the force push. I did not include the sign off on the previous commit and the DCO check required this

Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Signed-off-by: KarthikSubbarao <karthikrs2021@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants