[TRACKING] `metrics` framework future refactors / cleanups #3194

incertum · 2024-05-14T19:23:42Z

Motivation

Tracking pending cleanups, refactors or additional features for the Falco internal metrics framework https://falco.org/docs/metrics/

incertum · 2024-05-14T19:24:05Z

CC @FedeDP @sgaist @leogr

incertum · 2024-05-14T19:39:18Z

make libs_metrics_collector static new(metrics): add rules_counters_enabled option #3192 (comment)
sanitize_metric_name according to the Open Metric standard was introduced in Falco. Perhaps should be pushed as well to the libs metrics collector.

Great discussions happening in #3140, few follow up items

Falco's wrapper metrics num_evts is still missing in the Prometheus output since it requires greater code refactors.
Falco can run with a single event source (either syscalls or one plugin source) or with multiple event sources. Initially, the goal was to have metrics work with either syscalls or a plugin source. However, challenges arise when dealing with two or more event sources. For example, should the metrics display the number of events for the syscalls source or the plugin source? What should be the value of evt.source? Since we can only provide one view of the metrics at a time, instead of adding nested fields or implementing another solution, Falco metrics should focus on the syscalls or the primary plugin source only. When running Falco with syscalls and one plugin, the new plugin metrics API should be used to retrieve plugin metrics in addition to the syscalls metrics.
- see new(plugin_api): add plugin metrics support libs#1828
- multi-platform metrics support fix(libsinsp): enable metrics collector on all platforms libs#1870, plus see Regression: host metrics are not available running only source plugins #2821 still needs fixing
libs now includes a new metrics collector class that consolidates metrics across the libs codebase. Falco needs a similar solution. In feat(webserver): implement metrics endpoint #3140, @sgaist referred to it as a "proper Falco metrics model," especially since we now have more output channels for metrics (e.g., Prometheus, web server, output rule, output file). The goal is to simplify the codebase and reduce code duplication (e.g. see code duplication and fragmentation in falco_metrics.cpp, stats_writer.cpp, stats_manager.cpp).

leogr · 2024-05-16T14:49:03Z

The metrics framework should target the primary event source only, as the metrics snapshots can realistically only expose one current view, especially for Prometheus. Plugin metrics should instead be supported via the new plugin metrics support; see new(plugin_api): add plugin metrics support libs#1828

A consolidated and proper Falco metrics model is needed given that we now have even more outputs channels for the metrics (e.g. Prometheus)

Hey @incertum could you elaborate more on these two points?

incertum · 2024-05-16T19:59:50Z

@leogr I rewrote the text #3194 (comment), is it more clear? happy to add more details.

leogr · 2024-05-17T13:32:47Z

Much clearer now, thank you!

Just one thought:

Since we can only provide one view of the metrics at a time

Why? I guess this is a current limitation, but we can fix it in the future. Am I wrong?
I believe that in the long run, all data sources should be first-citizen, and it shouldn't be technically impossible to accommodate this.

incertum · 2024-05-18T07:53:55Z

Much clearer now, thank you!

Just one thought:

Since we can only provide one view of the metrics at a time

Why? I guess this is a current limitation, but we can fix it in the future. Am I wrong? I believe that in the long run, all data sources should be first-citizen, and it shouldn't be technically impossible to accommodate this.

We can emit multiple rules outputs or lines into the output file ( I would not do it though), but for Prometheus there is just one endpoint to scrape at a time ... IMO there should be more separate plugin specific metrics handling, something that was started in libs. Most metrics are syscalls source specific or generic (e.g. CPU and memory usages or rules counters) anyways. In a way right now I can only think of number of events as useful to be plugin / source specific in case you have multiple sources.

incertum · 2024-05-31T14:32:12Z

CC @sboschman (metrics for Falco w/ plugin only)

sboschman · 2024-06-03T09:39:03Z

From an operational point of view I like to have the falco metrics easily integrated with our metrics platform. So, I would like to thank everyone involved with exposing the falco metrics in a Prometheus compatible way.

I am not familiar with the falco code at all, so consider the following comments more as an outside view of things, not in any way directly mapping to any part of the code.

Falco metrics:

General metrics; unrelated to syscall or any plugin
- falco version
- start_timestamp
- ...
Process resource utilization metrics; unrelated to syscall or any plugin, preferably in standard naming conform the default C library for prometheus
- num_cpus
- cpu_seconds_total
- memory_bytes_total
- memory_used_bytes
- ...
Falco rule engine metrics; syscall/plugin event source can be a labeled dimension of the time serie
- events_processed_total{event_source="syscall"} or events_processed_total{event_source="k8saudit"}
- rule_matches_total{event_source="syscall"} or rulle_matches_total{event_source="k8saudit"}
- ...
syscall specific metrics
- ...
custom plugin metrics; provided with the plugin API / SDK, implemented by the plugin
- cloudtrail_xxx
- github_repositories
- gcpaudit_pubsub_errors_total
- okta_xxx
- ...

Notes:

Process metrics (2) are not Falco specific, any application/process should be able to provide these metrics in a standard way. If you are familiar with these standard metrics, you can easily apply your existing knowledge to any application/process. E.g. for Golang and Java we even have default dashboards for this base set of metrics.
Falco rule engine metrics (3) are dimensioned by event source. An overall total can easily be calculated by the metrics platform, e.g. with PromQL sum without(event_source) (events_processed_total{}) and has not to be explicitly exposed by Falco
I realise syscall is the original falco event source and the plugin framework, and support for other event sources, has been implemented later. As of 0.35 plugins can also output syscall events, so to me 'falco drivers' and 'plugins' are just a way to provide event input to the falco rule engine and syscall is just one of the event sources. Hence the metrics being split into items 3, 4 and 5.
Different plugins can provide the same event source, e.g. the k8saudit, k8saudit_eks and k8saudit_gke plugin all provide the k8s_audit event source. So (5) are plugin specific metrics, not event source specific metrics.

incertum · 2024-06-03T16:02:17Z

Few more thoughts:

@mrgian is working on exposing (5), see PRs linked to above [TRACKING] metrics framework future refactors / cleanups #3194 (comment)
As we explained earlier most of the current hiccups are because of a very complicated code refactor in the scap module that broke many of the metrics you listed under (2) when running Falco with a plugin only. @mrgian is also working on that, but it's a bit of a larger refactor. Meanwhile, Separating Metrics Reporting Responsibilities Between CNCF Project and TAG Environmental Sustainability Initiative cncf-green-review-testing#14 (reply in thread) CPU and memory usages can be consumed externally, but as said we are working on fixing the Falco native support for that as well.

incertum added the kind/feature label May 14, 2024

incertum added this to the 0.39.0 milestone May 14, 2024

incertum mentioned this issue May 16, 2024

fix(libsinsp): enable metrics collector on all platforms falcosecurity/libs#1870

Draft

incertum mentioned this issue May 31, 2024

Scraping prometheus metrics endpoint crashes falco process #3229

Closed

sboschman mentioned this issue Jun 3, 2024

cleanup(metrics): improve prometheus and plugin metrics info falcosecurity/falco-website#1328

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRACKING] `metrics` framework future refactors / cleanups #3194

[TRACKING] `metrics` framework future refactors / cleanups #3194

incertum commented May 14, 2024

incertum commented May 14, 2024

incertum commented May 14, 2024 •

edited

leogr commented May 16, 2024

incertum commented May 16, 2024

leogr commented May 17, 2024

incertum commented May 18, 2024

incertum commented May 31, 2024

sboschman commented Jun 3, 2024 •

edited

incertum commented Jun 3, 2024

[TRACKING] metrics framework future refactors / cleanups #3194

[TRACKING] metrics framework future refactors / cleanups #3194

Comments

incertum commented May 14, 2024

incertum commented May 14, 2024

incertum commented May 14, 2024 • edited

leogr commented May 16, 2024

incertum commented May 16, 2024

leogr commented May 17, 2024

incertum commented May 18, 2024

incertum commented May 31, 2024

sboschman commented Jun 3, 2024 • edited

incertum commented Jun 3, 2024

[TRACKING] `metrics` framework future refactors / cleanups #3194

[TRACKING] `metrics` framework future refactors / cleanups #3194

incertum commented May 14, 2024 •

edited

sboschman commented Jun 3, 2024 •

edited