jb… a weblog by Jonathan Buys

Managing Nagios Configs

We don’t have a very big Nagios installation, comparatively anyway, but it is big enough to find that the default layout for configurations is insane. I tried using the provided layout, until I wound up with single text files with thousands of lines in them. This made it very hard to do individual customizations for servers, and separating out who wants to be notified for what. Here is what I came up with for managing our Nagios configs.

It seems that the repositories are always behind in Nagios, so it is one of the very few apps that I recommend installing from source. I install Nagios in /usr/local/nagios, the default when compiling, I’ll just call it $nag. The Nagios binary is in $nag/bin, the plugins in $nag/libexec, and the config files in $nag/etc. The easiest way to understand nagios is to follow its start up procedures. I keep an /etc/init.d/nagios file for initialization, The file defines, among other things, where the home directory for Nagios is, what config file to use as its base, and where the Nagios binary and plugins are. The important thing to understand is that this file is the first pointer in a long string of pointers that Nagios uses for configuration.

Inside the nagios.cfg file are the cfg_dir directives. These are pointers that tell Nagios that it can find additional configurations inside the directories listed. Once Nagios is given a directory to look at, it will read each file ending in .cfg inside of that directory. The first directory that I have listed is $nag/etc/defaults. I keep four files in this directory: commands.cfg, dependencies.cfg, generic.cfg, and timeperiods.cfg.

The file “commands.cfg” contains the definitions of all check commands that Nagios can understand. They look like this:

 # 'check_local_load' command definition
 define command{
        command_name    check_local_load
        command_line    $USER1$/check_load -w $ARG1$ -c $ARG2$
        }

The file also contains the alert commands, or what Nagios will do when it finds something that it needs to let you know about:

define command{
command_name notify-by-email
command_line	/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: 	$NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: 	$HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/	bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" 	$CONTACTEMAIL$
}

This allows us to call a command later in Nagios by it’s defined command_name,such as check_local_load, instead of having to call the entire command including arguments. Keeps the configs clean.

The next file, “generic.cfg”, contains templates for host configurations. This file allows us to do two things: list common options that are defined for all of the hosts, and separate hosts into notification groups. The definitions look like this:

define host{
        name                            generic-admin
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1
        register                        0
        check_command           check-host-alive
        max_check_attempts      3
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
				contact_groups          admin,admin_pager
        action_url /nagios/pnp/index.php?host=$HOSTNAME$
        }

There are two separate types of generic definitions, hosts and services, for the two types of monitoring that Nagios does. The important section for most of my purposes above is the “contact_groups” line. This allows me to group contacts with hosts, so it answers the question of “who gets notified if this server goes down?”. The same thing applies to the service template below.

define service{
        name                            generic-full	
        active_checks_enabled           1
        passive_checks_enabled          1
        parallelize_check               1
        obsess_over_service             1
        check_freshness                 0
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1
        register                        0
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        notification_interval           120
        notification_period             24x7
        notification_options            w,c,r
	contact_groups                  admins,admin_pager,webmin
	process_perf_data 1
	action_url /nagios/pnp/index.php?host=$HOSTNAME$&srv=$SERVICEDESC$
        }

The other two files, timeperiods.cfg and dependencies.cfg, I haven’t done a whole lot with yet.

The next directory parsed as defined in nagios.cfg is $nag/etc/users, which, surprisingly enough, is where all of the users are defined. I keep two files in this directory, users.cfg and contactgroups,cfg. The users.cfg file contains a list of every user, and since I have different needs for pagers and regular email alerts, each user is defined twice:

define contact{
        contact_name                    Jon
        alias                           Jon Buys
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r
        host_notification_options       d,u,r
        service_notification_commands   service-notify-by-email
        host_notification_commands      host-notify-by-email
        email                           jbuys@dollarwork.com
        }

define contact{
        contact_name                    Jon_pager
        alias                           Jon Buys
  	  service_notification_period     24x7
  	  host_notification_period        24x7
  	  service_notification_options    u,c,r
  	  host_notification_options       d,u,r
  	  service_notification_commands   notify-for-disk
  	  host_notification_commands      host-notify-by-email
  	  email 				5555555555@my.phone.company.net
  	  }

This lets me group the users more effectively in the second file, contactgroups.cfg:

define contactgroup{
        contactgroup_name admins
  	  alias           sysadmins
  	  members Jon,Gary,nagios_alerts
  	  }

define contactgroup{
        contactgroup_name admin_pager
        alias           sysadmin pagers
        members Jon_pager,Gary_pager,OSS_Primary_Phone,nagios_alerts
}

Now, check the definitions in the generic.cfg file above, and you’ll start to see the chain of config files coming together. The glue sticking it all together is the server definition files. Each logical group of servers gets their own directory, defined in nagios.cfg. For example, we have a group of servers that provides a specific web service (which I’ll call “mesh”), there are web servers, application servers, and database servers that I group together in one directory, named “mesh”. Inside of this directory, each server has its own config file, named like $hostname.cfg. There is also a mesh.cfg, which groups all of the servers together in a host group. The $hostname.cfg files look like this:

 define host{
        use                     generic-host
  	  host_name              	m-app1 
  	  alias                   m-app1
  	  address                 10.10.10.1
  	  }

define service{
  	  use                             generic-full
  	  host_name                       m-app1
  	  service_description             PING
  	  check_command                   check_ping!100.0,20%!500.0,60%
  	  }

define service{
  	  use                             generic-full
  	  host_name                       m-app1
  	  service_description             DISKUSE
  	  check_command                   check_nrpe!check_df
  	  }

Each server has a host definition at the top, and all of the services that are monitored on that server at the bottom. The first section’s line “use generic-host” calls the “generic-host” template from the generic.cfg file above. Each subsequent “define service” section has a “use” line that also calls the templates defined in generic.cfg. Putting each server in its own file makes it very easy to add and remove servers from Nagios. To remove them, just remove (or, safer, rename) the $hostname.cfg file and delete the name from the $groupname.cfg file. It’s also very easy to script the creation of new hosts given a list of host names and IP addresses.

The mesh.cfg file contains the hostgroup configuration for the group:

define hostgroup{
   	 hostgroup_name  mesh
   	 alias           Mesh Production
   	 members         mdbs1,mdbs2,mdbs3,mdbs4,mdbs5,mdbs6,mdbs7,m-app1,m-app2,m-app3,m-store1,m-store2,m-nfs1,m-nfs2
   	 }

This file is not as important, but it makes the Nagios web interface a little more helpful.

You’ll also notice that the check_command line above contains “check_nrpe!check_df”. This means that I use the nrpe (Nagios Remote Plugin Execution) add-on to actually monitor the services on the remote hosts. Each server has nrpe installed, and has one configuration file (/usr/local/nagios/etc/nrpe.cfg). The nrpe.cfg file has a corresponding line that says

command[check_df]=/usr/local/nagios/libexec/check_disk -e -L -w 6% -c 4%

This translates the check_df command sent by the check_nrpe command into the longer command defined above. This makes it easy to install and configure nrpe once, then zip up the /usr/local/nagios directory and unzip it on all new servers.

Nagios is nearly limitless in its abilities, but but because of the complexity of its configuration it can be daunting to newcomers. This setup is designed to make it just a little bit easier to understand, and easier to script.

December 9, 2009 - 5 minute read - sysadmin nagios

Nagios Check Scheduling

Or, maybe a better title for this would be “They rebooted the server, why didn’t I get a page?” I’ve had that question asked of me a few times, and I’ve never had a good answer, so I thought I’d take a closer look at Nagios and see what is going on.

November 5, 2008 - 3 minute read - nagios sysadmin linux work