As any other monitoring system Nagios can produce false alarms. Usually it happens when Nagios fails to get the reply from the host being monitored during some pre-defined timeout. In order to mark service as down Nagios does three checks and if all of them are failed then the service is marked down and administrator will got an alert about its critical status. At the same time even if one of those checks fails Nagios will report administrator about it depending on configuration (e-mail, twitter, chat message, SMS etc.).
If you face some false alarms occasionally but the service is actually online then it makes sense to increase timeout value from default 10 seconds to, let’s say, 20 seconds. Moreover, if you have phone call alarms configured with nagios then this slight change may help to make your sleep better.
Open one of nagios’ configs where check commands are defined (usually it’s /etc/nagios/commands.cfg file) and find there a block named check_nrpe, add “-t 20” to the end of its command_line so it will look like below:
define command {
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -t 20
}
And restart Nagios.
Besides check_nrpe there are also other commands like check_http, check_smtp and others: all of them supports -t options so just modify them like check_nrpe depending on your Nagios timeout conditions.
I came to your website a few days ago and I have been reading through it regularly. You have a ton of very good information on your blog and i also really like the particular design of the site as well. Keep up the great work!
Yes, because you can’t change that wallpaper EVER.
Grad did set the bar for HealthVault and I am glad he did. It is possible to have a fail-safe system.