Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upstream: implementation of outlier detection extensions #34154

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

cpakulski
Copy link
Contributor

@cpakulski cpakulski commented May 14, 2024

Commit Message:
upstream: implementation of outlier detection extensions

Additional Description:
The idea and need for the extensions are described in RFC document: https://docs.google.com/document/d/1ZCZSoirVB39eOLdD0VPlsEUING8c23Sq5bzozrv6f4k/edit?usp=drive_link

In a nutshell, the design decouples types of result reported to outlier detector from an algorithm which marks a host as unhealthy. For example, an outlier detector may be configured to count 3 consecutive errors and type of those errors can be defined by a user. For example:

  • count http codes in range 500-503
  • count http codes in range 401-403
  • count locally originated errors (resets, timeouts)

So, the algorithm does not really care about the exact type of reported result. It is only interested whether reported result should be considered an error or not.
The idea of using user-defined errors can be expanded to database errors. See issue #24215 (I have working prototype for errors reported by Redis).

This design puts extensions on top of already existing outlier detector. It means that the solution is 100% backwards compatible. Previous configs are accepted. But a user may configure outlier detection extension to

  • use "old" outlier detection to react to 5xx errors and add another range of HTTP errors (say 4xx)
  • disable "old" outlier detection and configure everything using extensions

The implementation of extensions is built on top of already existing outlier detection structures. This minimizes code changes and re-uses event logger, timers, etc.

The implementation is built on top of already approved API for extensions: #31205

Risk Level: Low (previous configuration still works. Extensions do not have to be configured)
Testing: Added unit tests for new code and tests checking co-existence of "old" outlier and "new" extensions
Docs Changes: Yes. Added.
Release Notes: Yes.
Platform Specific Features: No
Fixes #18789

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
… monitor.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #34154 was opened by cpakulski.

see: more, trace.

Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @wbpcode
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #34154 was opened by cpakulski.

see: more, trace.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Contributor Author

/retest

@cpakulski cpakulski changed the title WIP upstream: implementation of outlier detection extensions upstream: implementation of outlier detection extensions May 17, 2024
@cpakulski cpakulski marked this pull request as ready for review May 17, 2024 19:13
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Contributor Author

@wbpcode I see that you have been assigned for API approval, but changes in proto files are not really API changes, but rather changes to event log which is defined in terms of protobufs. Hope it helps!

@nezdolik
Copy link
Member

will the old outlier detection extension be eventually deprecated? (given that the new one lands)

Copy link
Member

@nezdolik nezdolik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did partial first pass, will do more tomorrow. Looks like the extension docs (*.rst) also need to be added.

@@ -40,5 +45,6 @@ message ErrorBuckets {
repeated LocalOriginErrors local_origin_errors = 2;

// List of buckets "catching" database errors.
// [#not-implemented-hide:]
repeated DatabaseErrors database_errors = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense not to have this field as part of ErrorBuckets message since DatabaseErrors are not yet implemented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was originally added to illustrate how new error types can be added. I think it does no harm to keep it to be honest. That code can only be sent by terminal database filter (like redis) and it currently does not send that code, so it is no-op. I am planning to add support for databases as next step and will have to bring it back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am more in favor of adding fields into api together with the relevant PRs/implementation, otherwise there is a risk of having too many unimplemented features in api since people tend to be busy with work/can change priorities. cc @wbpcode

@@ -345,6 +415,10 @@ void DetectorImpl::initialize(Cluster& cluster) {
void DetectorImpl::addHostMonitor(HostSharedPtr host) {
ASSERT(host_monitors_.count(host) == 0);
DetectorHostMonitorImpl* monitor = new DetectorHostMonitorImpl(shared_from_this(), host);

monitor->setExtensionsMonitors(
config_.createMonitorExtensions(monitor->getOnFailedExtensioMonitorCallback()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is any perf penalty for calling createMonitorExtensions each time host is added in a large cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is called only for the newly added host only and does not affect already existing hosts. I could imagine creating pool of extensions, say on an idle thread, and attach it to newly added host. In general extensions are simple structures, so instantiation consists of couple of memory allocs and fields initializations.
I do not mind moving it somewhere else, but have no idea if there is a better place from performance point of view.

@cpakulski
Copy link
Contributor Author

will the old outlier detection extension be eventually deprecated? (given that the new one lands)

It depends on the community and adoption of extensions. In general, everything what the current outlier offers will be implemented in form of extensions, so deprecation should be easy. I believe that removing "old" outlier should improve performance a bit, because in the current implementation all monitors like frequency, success rate always run regardless whether users use them to eject nodes or not.

@cpakulski
Copy link
Contributor Author

Looks like the extension docs (*.rst) also need to be added.

Do you have a specific doc in mind? Maybe I missed something. I checked the docs output and "outlier detection" extensions are automatically added to docs based on proto's annotations.

Added release notes.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Contributor Author

/retest

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Contributor Author

/retest

envoy/upstream/outlier_detection.h Outdated Show resolved Hide resolved
@@ -67,12 +68,12 @@ void DetectorHostMonitorImpl::updateCurrentSuccessRateBucket() {

void DetectorHostMonitorImpl::putHttpResponseCode(uint64_t response_code) {
external_origin_sr_monitor_.incTotalReqCounter();
std::shared_ptr<DetectorImpl> detector = detector_.lock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternative approach we could take here is to disallow users to configure same outlier detection algorithm via traditional (old) way and via extension, or to fallback to new mechanism if both are configured. That would simplify the flow and make it easier to reason about ejection events. Does it even make sense to configure 2 same algorithms with diff params for same cluster, wonder what could be the use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that use case could be to use "old" method for 5xx errors and "new" method for 4xx range. "old" has been battle tested and users trust it, but it is not expandable for ranges outside of 5xx. Once there is enough usage of extensions, we can start limiting config options. WDYT?


void ConsecutiveErrorsMonitor::onReset() { counter_ = 0; }

class ConsecutiveErrorsMonitorFactory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of Envoy extensions follow a specific code layout, this is not enforced but helps to easier navigate the code, group tests etc. The factories that parse the extension proto config and create extension instances are usually placed into config.h+cc files. Extension related code goes into its dedicated header+impl files. For example:
extension config
extension

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont have string opinion here since some extensions deviate from this layout and this file does not have that much code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Thanks. I will leave it for now as it is.

namespace Outlier {

bool ConsecutiveErrorsMonitor::onError() {
if (counter_ < max_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if block does not avoid race conditions, e.g. you have counter_ == 29 and max_ == 30:

  • T1 reads the counter_ value in line 12
  • T2 increments counter_ value to 30
  • T1 executes line 13 and increases counter_ to 31

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! That is good point. Will change the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Converted to using CAS primitives. Thanks for catching it!

@nezdolik
Copy link
Member

will the old outlier detection extension be eventually deprecated? (given that the new one lands)

It depends on the community and adoption of extensions. In general, everything what the current outlier offers will be implemented in form of extensions, so deprecation should be easy. I believe that removing "old" outlier should improve performance a bit, because in the current implementation all monitors like frequency, success rate always run regardless whether users use them to eject nodes or not.

Deprecating old mechanism would as well reduce maintenance load on cognitive load for outlier detection api (where same thing can be achieved via different config mechanisms). Think we should do a proper deprecation cycle going forward.

@cpakulski
Copy link
Contributor Author

cpakulski commented May 23, 2024

will the old outlier detection extension be eventually deprecated? (given that the new one lands)

It depends on the community and adoption of extensions. In general, everything what the current outlier offers will be implemented in form of extensions, so deprecation should be easy. I believe that removing "old" outlier should improve performance a bit, because in the current implementation all monitors like frequency, success rate always run regardless whether users use them to eject nodes or not.

Deprecating old mechanism would as well reduce maintenance load on cognitive load for outlier detection api (where same thing can be achieved via different config mechanisms). Think we should do a proper deprecation cycle going forward.

OK. Let me try to add warnings that "old" consecutive errors will be deprecated. Thanks!

====================================================================================
Update:
Actually after some thinking I believe that it would be best to start deprecating and displaying warnings when extensions's functionality is equivalent to the "old" outlier. I mean after success rate and failure frequency monitors are implemented. Once we have this PR merged, the framework is ready and implementing those missing monitors should not take too much time. The reason is that "old" outlier is "flat". Once one option is enabled, it actually enables all methods and we would like to display warning only when a user uses "old" consecutive_5xx or consecutive_gateway or consecutive_local_origin.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Contributor Author

/retest

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Contributor Author

/retest

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Contributor Author

@nezdolik @wbpcode Thanks for reviewing this PR. I think I addressed all the comments. In some cases, I provided explanation why I am not planning to change anything and leave it as is. I would appreciate if you could do another pass. Thanks!

// no-op. Just keep executing compare_exchange_strong until threads synchronize.
do {
;
} while (!counter_.compare_exchange_strong(expected_count, expected_count + 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the default memory order memory_order_seq_cst here requires too much synchronization across worker threads, you could use relaxed order for cas failure and make writes to atomic visible to other threads on success (e.g. memory order release)

@nezdolik
Copy link
Member

@cpakulski thank you for all the work done! i added one more comment and believe this is now ready for further review. @wbpcode do you have capacity to review this one? (otherwise will tag senior maintainers)

@nezdolik
Copy link
Member

/assign-from @envoyproxy/senior-maintainers

Copy link

@envoyproxy/senior-maintainers assignee is @wbpcode

🐱

Caused by: a #34154 (comment) was created by @nezdolik.

see: more, trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Outlier Detection for non-error status codes
3 participants