Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_presence plugin: Add new plugin to detect new and missing hosts #4055

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

minfrin
Copy link
Contributor

@minfrin minfrin commented Oct 10, 2022

ChangeLog: check_presence plugin: Add new plugin to detect new and missing hosts.

This PR adds a new plugin that keeps track of hosts that have been seen by collectd, and sends notifications when a new host is seen, or an existing is no longer seen for a period of time, respectively.

It depends on liblmdb to provide the underlying index, chosen for low cost.

Example notification when a host is seen:

Notification: severity = OKAY, host = little-net.local, plugin = host, type = host, type_instance = found, message = Host is found: little-net.local

Example notification when a host goes missing:

Notification: severity = FAILURE, host = little-net.local, plugin = host, type = host, type_instance = lost, message = Host not seen for 45 seconds: little-net.local

…ssing hosts.

This PR adds a new plugin that keeps track of hosts that have been seen by collectd, and
sends notifications when a new host is seen, or an existing is no longer seen for a
period of time, respectively.

It depends on liblmdb to provide the underlying index, chosen for low cost.

Example notification when a host is seen:

Notification: severity = OKAY, host = little-net.local, plugin = host, type = host, type_instance = found, message = Host is found: little-net.local

Example notification when a host goes missing:

Notification: severity = FAILURE, host = little-net.local, plugin = host, type = host, type_instance = lost, message = Host not seen for 45 seconds: little-net.local
@minfrin minfrin requested a review from a team as a code owner October 10, 2022 19:09
@minfrin
Copy link
Contributor Author

minfrin commented Oct 10, 2022

Replaces #3934

@eero-t
Copy link
Contributor

eero-t commented Oct 25, 2022

How this differs from the original version in #3357?

Nowadays new plugins are preferred to be added to collectd-6.0 branch. While there is no release of the v6 yet, there are few upsides to using v6 branch:

  • Support for labels e.g. in write_prometheus output plugin
  • Fewer CI issues than in main branch

(I'm not a collectd developer, just a developer of another new plugin.)

@mrunge
Copy link
Member

mrunge commented Nov 8, 2022

I can only fully support @eero-t s comments. There's nothing to add, other than the CI failures don't look connected to the PR here.

@minfrin
Copy link
Contributor Author

minfrin commented Jan 6, 2023

How this differs from the original version in #3357?

Original version was submitted in 2019, patch has been reapplied as the underlying codebase moved and conflicted.

Nowadays new plugins are preferred to be added to collectd-6.0 branch. While there is no release of the v6 yet, there are few upsides to using v6 branch:

  • Support for labels e.g. in write_prometheus output plugin
  • Fewer CI issues than in main branch

(I'm not a collectd developer, just a developer of another new plugin.)

This was written before collectd 6 existed.

I took a look a while back for details of what was involved in supporting collectd 6, but I couldn't find anything the described the changes needed. Does anyone know?

Alternatively if there is a "hello world" collectd 6 plugin that shows the required structure I can rewrite it again.

@eero-t
Copy link
Contributor

eero-t commented Jan 9, 2023

Collectd documentation migration guide is a bit sparse: https://collectd.org/wiki/index.php/V5_to_v6_migration_guide

But you can use some of the simpler plugins that are in process of being, or have already been migrated to v6, as examples.

You'll find them from the open & closed pull requests list: https://github.com/collectd/collectd/pulls?q=is%3Apr+%22collectd+6%22+is%3Aopen

@minfrin
Copy link
Contributor Author

minfrin commented Jan 9, 2023

Collectd documentation migration guide is a bit sparse: https://collectd.org/wiki/index.php/V5_to_v6_migration_guide

Am I right in understanding that metric_t is going away, and metric_family_t is replacing it?

But you can use some of the simpler plugins that are in process of being, or have already been migrated to v6, as examples.

You'll find them from the open & closed pull requests list: https://github.com/collectd/collectd/pulls?q=is%3Apr+%22collectd+6%22+is%3Aopen

I could find only one notification plugin in the list above, and that was new so didn't show what changes were needed.

The check_presence plugin does not make use of metric_t. Can you confirm if any changes are needed so I can make them? So far it looks like the notification mechanism is unchanged?

@eero-t
Copy link
Contributor

eero-t commented Jan 9, 2023

Am I right in understanding that metric_t is going away, and metric_family_t is replacing it?

Yes.

The check_presence plugin does not make use of metric_t. Can you confirm if any changes are needed so I can make them?

I'm just an author of one v6 metric plugin, and did not participate in v6 API design or discussions (they happened before I got involved). Maybe @mrunge can answer?

So far it looks like the notification mechanism is unchanged?

Best may be if you just checkout collectd-6.0 branch and try building your plugin against it, and whether the resulting plugin works.

@minfrin
Copy link
Contributor Author

minfrin commented Jan 11, 2023

Best may be if you just checkout collectd-6.0 branch and try building your plugin against it, and whether the resulting plugin works.

Is collectd-6.0 older or newer than the main branch?

Alas the collectd-6.0 branch currently fails to build on both MacOS and Linux:

  CC       src/aggregation.lo
src/aggregation.c:718:22: error: unknown type name 'metric_single_t'; did you mean 'metric_list_t'?
 static int agg_write(metric_single_t const *m, /* {{{ */
                      ^~~~~~~~~~~~~~~
                      metric_list_t
src/aggregation.c: In function 'module_register':
src/aggregation.c:744:40: error: 'agg_write' undeclared (first use in this function); did you mean 'agg_read'?
   plugin_register_write("aggregation", agg_write, /* user_data = */ NULL);
                                        ^~~~~~~~~
                                        agg_read
src/aggregation.c:744:40: note: each undeclared identifier is reported only once for each function it appears in
make[1]: *** [Makefile:7862: src/aggregation.lo] Error 1
make[1]: Leaving directory '/home/minfrin/src/collectd/collectd-6.0'
make: *** [Makefile:5528: all] Error 2

@eero-t
Copy link
Contributor

eero-t commented Jan 11, 2023

Is collectd-6.0 older or newer than the main branch?

Both, since it was branched couple of years ago. Many of the plugins have not been updated since that, as people have made their PRs against v5 (main branch). Before v6 is released, those changes need to be synched to v6.

For now it's better to to use --disable-all-plugins configure option and enable just the things you're explicitly interested about (= your own plugin).

@minfrin
Copy link
Contributor Author

minfrin commented Jan 11, 2023

I am stuck looking for detailed examples of write plugins. Specifically I have this code, because I need the name of the value_list_t host:

key.mv_size = sizeof(vl->host);
key.mv_data = (void *)vl->host;

What function would I need to use to extract a host from a metric?

@eero-t
Copy link
Contributor

eero-t commented Jan 12, 2023

I need the name of the value_list_t host:

key.mv_size = sizeof(vl->host);
key.mv_data = (void *)vl->host;

What function would I need to use to extract a host from a metric?

I'm not sure I understand you question. host member in value_list_t is just a string: https://github.com/collectd/collectd/blob/collectd-6.0/src/daemon/plugin.h#L93 ?

Or do you mean function for getting value of a given label for a metric: https://github.com/collectd/collectd/blob/collectd-6.0/src/daemon/metric.h#L115 ?

@minfrin
Copy link
Contributor Author

minfrin commented Jan 12, 2023

Or do you mean function for getting value of a given label for a metric: https://github.com/collectd/collectd/blob/collectd-6.0/src/daemon/metric.h#L115 ?

What I'm after specifically is, given a metric_family_t*, how do I extract the host from that?

There are no examples the existing code for use of metric_label_get(), and this function operates on a single metric, not the whole metric_family.

I'm stuck as to what the intended way is to gain access to explicit values. All the examples loop through all values, and don't care about the content.

@eero-t
Copy link
Contributor

eero-t commented Jan 12, 2023

I'm not collectd developer myself, but this seems fairly obvious from the metric.h header.

Just loop through the family (metric type) metric array and query host label for each of the items:
https://github.com/collectd/collectd/blob/collectd-6.0/src/daemon/metric.h#L121 ?

Something like (completely untested):

const char* has_family_label_value(family_t *fam, const char *label, *const char *check) {
  assert(fam && label && check);
  metric_list_t list = fam.metric;
  for (int i = 0; i < list.num; i++) {
     const char *value = metric_label_get(list.ptr[i], label);
     if (!value) continue;
     if (strcmp(value, check) == 0) return true;
  }
  return false;
}

If your intention is checking whether some host value is new, you obviously need to think carefully how you store your items (should you e.g. qsort items so you can bsearch) and do the queries & checks (e.g. is your inner loop for the list of hosts, or for metrics in given family) to avoid quadratic behavior when number of hosts and metrics grows.

@minfrin
Copy link
Contributor Author

minfrin commented Jan 12, 2023

When what is previously a trivial one liner is now nine lines of code, that's a red flag that something is being done incorrectly, or the function call I'm supposed to use is being missed.

@eero-t
Copy link
Contributor

eero-t commented Jan 12, 2023

There were no labels in v5, it's a completely new feature that matches to what modern telemetry components need (see e.g. OpenTelemetry spec), so comparing it to v5 is a bit pointless.

Because of your question, I was assuming that something in v6 would be setting host info to a label in metrics, and you would then want to get them from there, but maybe you want something else, I don't know... It all is in headers though, just look at them, and if that's not enough, check the corresponding .c file(s). They are all pretty straightforward.

PS. Adding one more database dependency to collectd seems a bit dubious, when existing plugins already depend on several different ones...

@minfrin
Copy link
Contributor Author

minfrin commented Jan 12, 2023

There were no labels in v5, it's a completely new feature that matches to what modern telemetry components need (see e.g. OpenTelemetry spec), so comparing it to v5 is a bit pointless.

Because of your question, I was assuming that something in v6 would be setting host info to a label in metrics, and you would then want to get them from there, but maybe you want something else, I don't know... It all is in headers though, just look at them, and if that's not enough, check the corresponding .c file(s). They are all pretty straightforward.

The reason I'm lost is that this piece of information no longer exists, containing the host that sent the metric:

https://github.com/collectd/collectd/blob/main/src/daemon/plugin.h#L108

What I don't see is any coherent mapping from these old values in collectd v5 to collectd v6. There are no #defines coming up in any searches corresponding to "host". Is the "host" now a user defined string that should be made configurable?

Is "host" optional now, or can I still get the name of the host that logged the metric?

PS. Adding one more database dependency to collectd seems a bit dubious, when existing plugins already depend on several different ones...

I covered this a number of years ago. The plugin needs to do a read on every metric write, and so the database used to do this must do extremely cheap lock-free reads. The database has been carefully chosen to meet this requirement. The database is purely a library, has no configuration, and takes up very little space. The collectd plugin mechanism already ensures that dependencies are limited in scope to the plugins that need them.

@eero-t
Copy link
Contributor

eero-t commented Jan 13, 2023

The host member is still there in the values list as can be seen from link above.

However, when looking at the code:

So yeah, it seems to be available only as label now, and by default, only for metrics coming from plugins that have not been converted yet.

However, already in v5:

So your plugin is indirectly tracking changes in non-changing global variable value by constantly inspecting (all?) metrics??? WTF?

That's something that makes sense to do only once. And you don't need collectd metrics for that, you can just query it directly.

Also, in #3357, you claim that "we have to detect hosts that have disappeared during times when collectd is not running, for example during restarts of collectd or reboots of the machine running collectd".

But this plugin is instead tracking changes in collectd configuration Hostname setting, and when that is not set, changes in the host hostname (hardcoded in /etc/ or coming from DNS). Which could change independently of whether the machine is running or not.

If you want to actually identify different host machines, use their ID: https://www.freedesktop.org/software/systemd/man/machine-id.html

Additionally, doing this in collectd, does not detect host going away. That's something that you would need to do outside of the host itself. So this whole plugin seems kind of bogus...

@minfrin
Copy link
Contributor Author

minfrin commented Jan 13, 2023

That's what I was looking for - host maps to "instance". Ideally this should be a #define and not a string buried in the code, as the risk is that plugins might start choosing their own constants.

However, already in v5:

So your plugin is indirectly tracking changes in non-changing global variable value by constantly inspecting (all?) metrics??? WTF?

That's something that makes sense to do only once. And you don't need collectd metrics for that, you can just query it directly.

Also, in #3357, you claim that "we have to detect hosts that have disappeared during times when collectd is not running, for example during restarts of collectd or reboots of the machine running collectd".

But this plugin is instead tracking changes in collectd configuration Hostname setting, and when that is not set, changes in the host hostname (hardcoded in /etc/ or coming from DNS). Which could change independently of whether the machine is running or not.

If you want to actually identify different host machines, use their ID: https://www.freedesktop.org/software/systemd/man/machine-id.html

Additionally, doing this in collectd, does not detect host going away. That's something that you would need to do outside of the host itself. So this whole plugin seems kind of bogus...

Please tone down the aggressive responses like "WTF" and "kind of bogus", they make working on this unnecessarily difficult.

For the vast majority of the plugins in collectd, this host is the source of the data, and so the host value will be picked up by a static mechanism, like a global static variable. Obviously this plugin is not relevant for these cases.

For network plugins which receive data from other hosts using protocols like the collectd protocol, SNMP, etc, the hostname of the remote host supplying the data is now important. Given that none of the network plugins appear to have been upgraded to collectd-6.0 yet, it's easy to miss this detail.

When network plugins are receiving data from other hosts, the host field is now no longer our host but the remote host instead, and the check_presence plugin kicks in. It will generate a notification the very first time some remote host was seen, and will generate a notification for you as soon as your remote host is not longer being seen.

Crucially you do not have to configure anything at all to get this information, it is completely automatic and zero effort.

We do not want a flood of false alerts when collectd is stopped and restarted. Nobody deserves a heart attack. For this reason we keep track of hosts across a collectd restart, so if the host was there when we stopped, and is still there when we started again, we don't say anything and panic anybody unnecessarily.

@eero-t
Copy link
Contributor

eero-t commented Jan 13, 2023

Sorry about the tone. Your clarifications should really have been in the initial PR description and/or the plugin documentation. :-/

(I.e. plugin being intended to extract host info from metrics provided by network plugins that collect data from other hosts, and which metric data includes remote host names.)

Other things that the PR description / docs should state is this being a write plugin. And like with other write plugins, I think that should also be reflected in its name.

Additionally, I don't think you should be registering your functions anywhere else than in module_register() function. There might be locking problems otherwise. If init fails, write function should not get called. I'm not sure about shutdown, but to be sure, your shutdown code should handle partial init.

Because module is threading internally... Have you run collectd with your plugin under Valgrind --tool=memcheck, --tool=drd and --tool=helgrind options?

For the vast majority of the plugins in collectd, this host is the source of the data, and so the host value will be picked up by a static mechanism, like a global static variable. Obviously this plugin is not relevant for these cases.

For network plugins which receive data from other hosts using protocols like the collectd protocol, SNMP, etc, the hostname of the remote host supplying the data is now important.

So this is intended for collectd configurations that include only networking plugins?

Given that none of the network plugins appear to have been upgraded to collectd-6.0 yet, it's easy to miss this detail.

Guilty. There's some recent discussion for migrating "network" plugin, see: #3675

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants