-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check_presence plugin: Add new plugin to detect new and missing hosts #4055
base: main
Are you sure you want to change the base?
Conversation
…ssing hosts. This PR adds a new plugin that keeps track of hosts that have been seen by collectd, and sends notifications when a new host is seen, or an existing is no longer seen for a period of time, respectively. It depends on liblmdb to provide the underlying index, chosen for low cost. Example notification when a host is seen: Notification: severity = OKAY, host = little-net.local, plugin = host, type = host, type_instance = found, message = Host is found: little-net.local Example notification when a host goes missing: Notification: severity = FAILURE, host = little-net.local, plugin = host, type = host, type_instance = lost, message = Host not seen for 45 seconds: little-net.local
Replaces #3934 |
How this differs from the original version in #3357? Nowadays new plugins are preferred to be added to
(I'm not a collectd developer, just a developer of another new plugin.) |
I can only fully support @eero-t s comments. There's nothing to add, other than the CI failures don't look connected to the PR here. |
Original version was submitted in 2019, patch has been reapplied as the underlying codebase moved and conflicted.
This was written before collectd 6 existed. I took a look a while back for details of what was involved in supporting collectd 6, but I couldn't find anything the described the changes needed. Does anyone know? Alternatively if there is a "hello world" collectd 6 plugin that shows the required structure I can rewrite it again. |
Collectd documentation migration guide is a bit sparse: https://collectd.org/wiki/index.php/V5_to_v6_migration_guide But you can use some of the simpler plugins that are in process of being, or have already been migrated to v6, as examples. You'll find them from the open & closed pull requests list: https://github.com/collectd/collectd/pulls?q=is%3Apr+%22collectd+6%22+is%3Aopen |
Am I right in understanding that metric_t is going away, and metric_family_t is replacing it?
I could find only one notification plugin in the list above, and that was new so didn't show what changes were needed. The check_presence plugin does not make use of metric_t. Can you confirm if any changes are needed so I can make them? So far it looks like the notification mechanism is unchanged? |
Yes.
I'm just an author of one v6 metric plugin, and did not participate in v6 API design or discussions (they happened before I got involved). Maybe @mrunge can answer?
Best may be if you just checkout |
Is collectd-6.0 older or newer than the main branch? Alas the collectd-6.0 branch currently fails to build on both MacOS and Linux:
|
Both, since it was branched couple of years ago. Many of the plugins have not been updated since that, as people have made their PRs against v5 (main branch). Before v6 is released, those changes need to be synched to v6. For now it's better to to use |
I am stuck looking for detailed examples of write plugins. Specifically I have this code, because I need the name of the value_list_t host:
What function would I need to use to extract a host from a metric? |
I'm not sure I understand you question. Or do you mean function for getting value of a given label for a metric: https://github.com/collectd/collectd/blob/collectd-6.0/src/daemon/metric.h#L115 ? |
What I'm after specifically is, given a metric_family_t*, how do I extract the host from that? There are no examples the existing code for use of metric_label_get(), and this function operates on a single metric, not the whole metric_family. I'm stuck as to what the intended way is to gain access to explicit values. All the examples loop through all values, and don't care about the content. |
I'm not collectd developer myself, but this seems fairly obvious from the Just loop through the family (metric type) metric array and query host label for each of the items: Something like (completely untested):
If your intention is checking whether some host value is new, you obviously need to think carefully how you store your items (should you e.g. qsort items so you can bsearch) and do the queries & checks (e.g. is your inner loop for the list of hosts, or for metrics in given family) to avoid quadratic behavior when number of hosts and metrics grows. |
When what is previously a trivial one liner is now nine lines of code, that's a red flag that something is being done incorrectly, or the function call I'm supposed to use is being missed. |
There were no labels in v5, it's a completely new feature that matches to what modern telemetry components need (see e.g. OpenTelemetry spec), so comparing it to v5 is a bit pointless. Because of your question, I was assuming that something in v6 would be setting host info to a label in metrics, and you would then want to get them from there, but maybe you want something else, I don't know... It all is in headers though, just look at them, and if that's not enough, check the corresponding .c file(s). They are all pretty straightforward. PS. Adding one more database dependency to collectd seems a bit dubious, when existing plugins already depend on several different ones... |
The reason I'm lost is that this piece of information no longer exists, containing the host that sent the metric: https://github.com/collectd/collectd/blob/main/src/daemon/plugin.h#L108 What I don't see is any coherent mapping from these old values in collectd v5 to collectd v6. There are no #defines coming up in any searches corresponding to "host". Is the "host" now a user defined string that should be made configurable? Is "host" optional now, or can I still get the name of the host that logged the metric?
I covered this a number of years ago. The plugin needs to do a read on every metric write, and so the database used to do this must do extremely cheap lock-free reads. The database has been carefully chosen to meet this requirement. The database is purely a library, has no configuration, and takes up very little space. The collectd plugin mechanism already ensures that dependencies are limited in scope to the plugins that need them. |
The host member is still there in the values list as can be seen from link above. However, when looking at the code:
So yeah, it seems to be available only as label now, and by default, only for metrics coming from plugins that have not been converted yet. However, already in v5:
So your plugin is indirectly tracking changes in non-changing global variable value by constantly inspecting (all?) metrics??? WTF? That's something that makes sense to do only once. And you don't need collectd metrics for that, you can just query it directly. Also, in #3357, you claim that "we have to detect hosts that have disappeared during times when collectd is not running, for example during restarts of collectd or reboots of the machine running collectd". But this plugin is instead tracking changes in collectd configuration If you want to actually identify different host machines, use their ID: https://www.freedesktop.org/software/systemd/man/machine-id.html Additionally, doing this in collectd, does not detect host going away. That's something that you would need to do outside of the host itself. So this whole plugin seems kind of bogus... |
That's what I was looking for - host maps to "instance". Ideally this should be a #define and not a string buried in the code, as the risk is that plugins might start choosing their own constants.
Please tone down the aggressive responses like "WTF" and "kind of bogus", they make working on this unnecessarily difficult. For the vast majority of the plugins in collectd, this host is the source of the data, and so the host value will be picked up by a static mechanism, like a global static variable. Obviously this plugin is not relevant for these cases. For network plugins which receive data from other hosts using protocols like the collectd protocol, SNMP, etc, the hostname of the remote host supplying the data is now important. Given that none of the network plugins appear to have been upgraded to collectd-6.0 yet, it's easy to miss this detail. When network plugins are receiving data from other hosts, the host field is now no longer our host but the remote host instead, and the check_presence plugin kicks in. It will generate a notification the very first time some remote host was seen, and will generate a notification for you as soon as your remote host is not longer being seen. Crucially you do not have to configure anything at all to get this information, it is completely automatic and zero effort. We do not want a flood of false alerts when collectd is stopped and restarted. Nobody deserves a heart attack. For this reason we keep track of hosts across a collectd restart, so if the host was there when we stopped, and is still there when we started again, we don't say anything and panic anybody unnecessarily. |
Sorry about the tone. Your clarifications should really have been in the initial PR description and/or the plugin documentation. :-/ (I.e. plugin being intended to extract host info from metrics provided by network plugins that collect data from other hosts, and which metric data includes remote host names.) Other things that the PR description / docs should state is this being a write plugin. And like with other write plugins, I think that should also be reflected in its name. Additionally, I don't think you should be registering your functions anywhere else than in module_register() function. There might be locking problems otherwise. If init fails, write function should not get called. I'm not sure about shutdown, but to be sure, your shutdown code should handle partial init. Because module is threading internally... Have you run collectd with your plugin under Valgrind
So this is intended for collectd configurations that include only networking plugins?
Guilty. There's some recent discussion for migrating "network" plugin, see: #3675 |
ChangeLog: check_presence plugin: Add new plugin to detect new and missing hosts.
This PR adds a new plugin that keeps track of hosts that have been seen by collectd, and sends notifications when a new host is seen, or an existing is no longer seen for a period of time, respectively.
It depends on liblmdb to provide the underlying index, chosen for low cost.
Example notification when a host is seen:
Notification: severity = OKAY, host = little-net.local, plugin = host, type = host, type_instance = found, message = Host is found: little-net.local
Example notification when a host goes missing:
Notification: severity = FAILURE, host = little-net.local, plugin = host, type = host, type_instance = lost, message = Host not seen for 45 seconds: little-net.local