jb… a weblog by Jonathan Buys

Managing Nagios Configs

December 9, 2009

We don’t have a very big Nagios installation, comparatively anyway, but it is big enough to find that the default layout for configurations is insane. I tried using the provided layout, until I wound up with single text files with thousands of lines in them. This made it very hard to do individual customizations for servers, and separating out who wants to be notified for what. Here is what I came up with for managing our Nagios configs.

It seems that the repositories are always behind in Nagios, so it is one of the very few apps that I recommend installing from source. I install Nagios in /usr/local/nagios, the default when compiling, I’ll just call it $nag. The Nagios binary is in $nag/bin, the plugins in $nag/libexec, and the config files in $nag/etc. The easiest way to understand nagios is to follow its start up procedures. I keep an /etc/init.d/nagios file for initialization, The file defines, among other things, where the home directory for Nagios is, what config file to use as its base, and where the Nagios binary and plugins are. The important thing to understand is that this file is the first pointer in a long string of pointers that Nagios uses for configuration.

Inside the nagios.cfg file are the cfg_dir directives. These are pointers that tell Nagios that it can find additional configurations inside the directories listed. Once Nagios is given a directory to look at, it will read each file ending in .cfg inside of that directory. The first directory that I have listed is $nag/etc/defaults. I keep four files in this directory: commands.cfg, dependencies.cfg, generic.cfg, and timeperiods.cfg.

The file “commands.cfg” contains the definitions of all check commands that Nagios can understand. They look like this:

 # 'check_local_load' command definition
 define command{
        command_name    check_local_load
        command_line    $USER1$/check_load -w $ARG1$ -c $ARG2$

The file also contains the alert commands, or what Nagios will do when it finds something that it needs to let you know about:

define command{
command_name notify-by-email
command_line	/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: 	$NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: 	$HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/	bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" 	$CONTACTEMAIL$

This allows us to call a command later in Nagios by it’s defined command_name,such as check_local_load, instead of having to call the entire command including arguments. Keeps the configs clean.

The next file, “generic.cfg”, contains templates for host configurations. This file allows us to do two things: list common options that are defined for all of the hosts, and separate hosts into notification groups. The definitions look like this:

define host{
        name                            generic-admin
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1
        register                        0
        check_command           check-host-alive
        max_check_attempts      3
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
				contact_groups          admin,admin_pager
        action_url /nagios/pnp/index.php?host=$HOSTNAME$

There are two separate types of generic definitions, hosts and services, for the two types of monitoring that Nagios does. The important section for most of my purposes above is the “contact_groups” line. This allows me to group contacts with hosts, so it answers the question of “who gets notified if this server goes down?”. The same thing applies to the service template below.

define service{
        name                            generic-full	
        active_checks_enabled           1
        passive_checks_enabled          1
        parallelize_check               1
        obsess_over_service             1
        check_freshness                 0
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1
        register                        0
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        notification_interval           120
        notification_period             24x7
        notification_options            w,c,r
	contact_groups                  admins,admin_pager,webmin
	process_perf_data 1
	action_url /nagios/pnp/index.php?host=$HOSTNAME$&srv=$SERVICEDESC$

The other two files, timeperiods.cfg and dependencies.cfg, I haven’t done a whole lot with yet.

The next directory parsed as defined in nagios.cfg is $nag/etc/users, which, surprisingly enough, is where all of the users are defined. I keep two files in this directory, users.cfg and contactgroups,cfg. The users.cfg file contains a list of every user, and since I have different needs for pagers and regular email alerts, each user is defined twice:

define contact{
        contact_name                    Jon
        alias                           Jon Buys
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r
        host_notification_options       d,u,r
        service_notification_commands   service-notify-by-email
        host_notification_commands      host-notify-by-email
        email                           jbuys@dollarwork.com

define contact{
        contact_name                    Jon_pager
        alias                           Jon Buys
  	  service_notification_period     24x7
  	  host_notification_period        24x7
  	  service_notification_options    u,c,r
  	  host_notification_options       d,u,r
  	  service_notification_commands   notify-for-disk
  	  host_notification_commands      host-notify-by-email
  	  email 				5555555555@my.phone.company.net

This lets me group the users more effectively in the second file, contactgroups.cfg:

define contactgroup{
        contactgroup_name admins
  	  alias           sysadmins
  	  members Jon,Gary,nagios_alerts

define contactgroup{
        contactgroup_name admin_pager
        alias           sysadmin pagers
        members Jon_pager,Gary_pager,OSS_Primary_Phone,nagios_alerts

Now, check the definitions in the generic.cfg file above, and you’ll start to see the chain of config files coming together. The glue sticking it all together is the server definition files. Each logical group of servers gets their own directory, defined in nagios.cfg. For example, we have a group of servers that provides a specific web service (which I’ll call “mesh”), there are web servers, application servers, and database servers that I group together in one directory, named “mesh”. Inside of this directory, each server has its own config file, named like $hostname.cfg. There is also a mesh.cfg, which groups all of the servers together in a host group. The $hostname.cfg files look like this:

 define host{
        use                     generic-host
  	  host_name              	m-app1 
  	  alias                   m-app1

define service{
  	  use                             generic-full
  	  host_name                       m-app1
  	  service_description             PING
  	  check_command                   check_ping!100.0,20%!500.0,60%

define service{
  	  use                             generic-full
  	  host_name                       m-app1
  	  service_description             DISKUSE
  	  check_command                   check_nrpe!check_df

Each server has a host definition at the top, and all of the services that are monitored on that server at the bottom. The first section’s line “use generic-host” calls the “generic-host” template from the generic.cfg file above. Each subsequent “define service” section has a “use” line that also calls the templates defined in generic.cfg. Putting each server in its own file makes it very easy to add and remove servers from Nagios. To remove them, just remove (or, safer, rename) the $hostname.cfg file and delete the name from the $groupname.cfg file. It’s also very easy to script the creation of new hosts given a list of host names and IP addresses.

The mesh.cfg file contains the hostgroup configuration for the group:

define hostgroup{
   	 hostgroup_name  mesh
   	 alias           Mesh Production
   	 members         mdbs1,mdbs2,mdbs3,mdbs4,mdbs5,mdbs6,mdbs7,m-app1,m-app2,m-app3,m-store1,m-store2,m-nfs1,m-nfs2

This file is not as important, but it makes the Nagios web interface a little more helpful.

You’ll also notice that the check_command line above contains “check_nrpe!check_df”. This means that I use the nrpe (Nagios Remote Plugin Execution) add-on to actually monitor the services on the remote hosts. Each server has nrpe installed, and has one configuration file (/usr/local/nagios/etc/nrpe.cfg). The nrpe.cfg file has a corresponding line that says

command[check_df]=/usr/local/nagios/libexec/check_disk -e -L -w 6% -c 4%

This translates the check_df command sent by the check_nrpe command into the longer command defined above. This makes it easy to install and configure nrpe once, then zip up the /usr/local/nagios directory and unzip it on all new servers.

Nagios is nearly limitless in its abilities, but but because of the complexity of its configuration it can be daunting to newcomers. This setup is designed to make it just a little bit easier to understand, and easier to script.

Blizzard 2009

December 9, 2009

100_1931, originally uploaded by jonbuys.

Iowa got its first big storm of the winter season yesterday, and as of right now its still going on. We couldn't go anywhere even if we wanted to. We got about 13" of snow so far, but the wind gusts up to 50mph are the big problem. Just about everything is shut down, schools, work places, and even some of the larger roads.

Good day to get caught up on somethings I've been meaning to get done.

New SysAdmin Tips

December 4, 2009

My answer to a great question over at serverfault.

First off, find your logs. Most Linux distros log to /var/log/messages, although I’ve seen a couple log to /var/log/syslog. If something is wrong, most likely there will be some relevant information in the logs. Also, if you are dealing with email at all, don’t forget /var/log/mail. Double-check your applications, find out if any of them log somewhere ridiculous, outside of syslog.

Brush up on your vi skills. Nano might be what all the cool kids are using these days, but experience has taught me that vi is the only text editor that is guaranteed to be on the system. Once you get used to the keyboard shortcuts, and start creating your own triggers, vi will be like second nature to you.

Read the man page, and then run the following commands on each machine, and copy the results into your documentation:

    cat /etc/*release*
    cat /etc/hosts
    cat /etc/resolv.conf
    cat /etc/nsswitch
    df -h
    ifconfig -a
    free -m
    crontab -l
    ls /etc/cron.d
    echo $SHELL

That will serve as the beginnings of your documentation. Those commands let you know your environment, and can help narrow down problems later on.

Grep through your logs and search for “error” or “failed”. That will give you an idea of what’s not working as it should. Your users will give you their opinion on whats wrong, listen closely to what they have to say. They don’t understand the system, but they see it in a different way than you do.

When you have a problem, check things in this order:

  1. Disk Space (df -h): Linux, and some apps that run on Linux, do some very strange things when disk space runs out. It may seem unrelated, until you check and find a filesystem 100% full.

  2. Top: Top will let you know if you’ve got some process that’s stuck out there eating up all of your available CPU cycles. Nothing should consume 99% CPU for any extended period of time. If its a legitimate process, it should probably fluctuate up and down. While you are in top, check…

  3. System Load: The system load should normally be below 3 on a standard server or workstation. The system load is based on CPU, memory, and I/O.

  4. Memory (free -m): RAM use in Linux is a little different. It’s not uncommon to see a server with nearly all of its RAM used up. Don’t Panic, if you see this, it’s mostly just cache, and will be cleared out as needed. However, pay close attention to the amount of swap in use. If possible, keep this as close to zero as you can. Insufficient memory can lead to all kinds of performance problems.

  5. Logs: Go back to your logs, run tail -500 /var/log/messages more and start reading through and seeing what’s been going on. Hopefully, the logs will be able to point you in the direction you need to go next.

A well maintained Linux server can run for years without problems. We just shut one down that had been running for 748 days, and we only shut it down because we had migrated the application over to new hardware. Hopefully, this will help you get your feet wet, and get you off to a good start.

One last thing, always make a copy of a config file you intend to change, and always copy the line you are changing, and comment out the original, adding your reason for changing it. This will get you into the habit of documenting as you go, and may save your hide 9 months down the road.

Linux Hidden ARP

October 9, 2009

To enable an interface on a web server to be part of an IBM load balanced cluster, we need to be able to share an ip address between multiple machines. This breaks the IP protocol however, because you could never be sure which machine will answer for a request for that IP address. To fix this problem, we need to get down into the IP protocol and investigate how the Address Resolution Protocol or ARP, works.

Bear with me as I go into a short description on how an IP device operates on an IP network. When a device receives a packet from its network, it will look at the destination IP address and ask a series of questions from it:

  1. Is this MY ip address?
  2. Is this ip address on a network that I am directly connected to?
  3. Do I know how to get to this network?

If the answer to the first question is yes, then the job is done, because the packet reached its destination. If the answer is no, it asks the second question. If the answer to the second question is no, it asks the third question, and either drops the packet as unroutable, or forwards the packet on to the next IP hop, normally the device’s default gateway.

However, if the answer to the second question is yes, the device follows another method to determine how to get the packet to it’s destination. IP addresses are not really used on local networks except by higher level tools or network aware application. On the lower level, all local subnet traffic is routed by MAC address. So when the device needs to send a packet to an IP address on the subnet that it is attached to, it follows these steps:

  1. Check my ARP table for an IP to MAC address mapping
  2. If needed, issue an ARP broadcast for the IP address – an ARP broadcast is a question going out to all devices on the subnet that has the simple setup of “if this is your IP address, give me your MAC address”
  3. Once the reply for the ARP address is received, the packet is forwarded to the appropriate host.

So, to put this all in perspective, when multiple machines share the same IP address, each of the machines will reply to the ARP request, and depending on the order in which the replies are received, it is entirely possible that a different machine will respond each time. When this happens, it breaks the load balancing architecture, and brings us down to one server actually in use.

The next question is normally: Why is that? Why do the web servers need that IP address anyway? The answer to this is also deep in the IP protocol, and requires a brief explanation of how the load balancing architecture works.

To the outside world, there is one ip address for myserv.whatever. Our public address is (or, whatever). This address is assigned three places on one subnet: load balancer, first web server, and second web server. The only server that is needs to respond to ARP requests is load balancer. When the load balancer receives a packet destined for, it replaces the destination MAC address with one of the addresses from one of the web servers, first web server or second web server, and forwards it on. This packet still has the original source and destination IP addresses on it, so remember what happens when an IP device on an IP network receives a packet… it asks the three questions outlined above. So, if the web servers did not have the address assigned to them, they would drop the packet (because they are not set up to route, they would not bother asking the second or third questions). Since the web servers do have the ip address assigned to one of their interfaces, they accept the packet and respond to the request (usually an http request).

So, that covers the why?, let’s look at how?. Enable the hidden ARP function by entering the following into /etc/sysctl.conf:

# Disable response to broadcasts. 
# You don't want yourself becoming a Smurf amplifier.
net.ipv4.icmp_echo_ignore_broadcasts = 1 
# enable route verification on all interfaces 
net.ipv4.conf.all.rp_filter = 1 
# enable ipV6 forwarding 
#net.ipv6.conf.all.forwarding = 1 
net.ipv4.conf.all.arp_ignore = 3 
net.ipv4.conf.all.arp_announce = 2

The relevant settings are explained here:

arp_ignore = 3: Do not reply for local addresses configured with scope host, only resolutions for global and link addresses are replied.

For this setting the really interesting part is the configured with scope host part. Before, using ifconfig to assign addresses to interfaces we did not have the option to configure a scope on an interface. A newer (well, relatively speaking) command, ip addr is needed to assign the scope of host to the loopback device. The command to do this is:

ip addr add scope host dev lo label lo:1

There are some important differences in the syntax of this command that need to be understood to make use of it on a regular basis. The first is the idea of a label being added to an interface. ip addr does not attempt to fool you into thinking that you have multiple physical interfaces, it will allow you to add multiple addresses to an existing interface and apply labels to them to distinguish them from each other. The labels allow ifconfig to read the configuration and see the labels as different devices.


lo	Link encap:Local Loopback 
	inet addr: Mask: 
	inet6 addr: ::1/128 Scope:Host 
	RX packets:9477 errors:0 dropped:0 overruns:0 frame:0 
	TX packets:9477 errors:0 dropped:0 overruns:0 carrier:0 
	collisions:0 txqueuelen:0 
	RX bytes:902055 (880.9 Kb) TX bytes:902055 (880.9 Kb)
lo:1	Link encap:Local Loopback
		inet addr: Mask:
lo:2	Link encap:Local Loopback 
		inet addr: Mask: 

Here, lo, lo:1, and lo:2 are viewed as separate devices by ifconfig.

Here is the output from the ip addr show command:

1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet brd scope host lo
    inet scope host lo:1
    inet scope host lo:2
    inet scope host lo:3
    inet scope host lo:4
    inet scope host lo:5
    inet scope host lo:8
    inet scope host lo:9
    inet scope host lo:10
    inet scope host lo:11
    inet scope host lo:12
    inet scope host lo:13
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever

Here we can see that the lo:1 (etc…) addresses are assigned directly under the standard lo interface, and are only differentiated from the standard loopback address by their label.

Here is the same output from the eth2 device:

4: eth2: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:10:18:2e:2e:a2 brd ff:ff:ff:ff:ff:ff
    inet brd scope global eth2
    inet brd scope global secondary eth2:1
    inet brd scope global secondary eth2:2
    inet brd scope global secondary eth2:3
    inet brd scope global secondary eth2:4
    inet brd scope global secondary eth2:7
    inet brd scope global secondary eth2:8
    inet brd scope global secondary eth2:9
    inet brd scope global secondary eth2:10
    inet brd scope global secondary eth2:11
    inet brd scope global secondary eth2:5
    inet brd scope global secondary eth2:12
    inet brd scope global secondary eth2:13
    inet6 fe80::210:18ff:fe2e:2ea2/64 scope link 
       valid_lft forever preferred_lft forever

Same as above, the addresses do not create virtual interfaces, they are simply applied to the real interface and assigned a label for management by ifconfig. Without the label, ifconfig will not see the assigned address.

arp_announce = 2: Always use the best local address for this target. In this mode we ignore the source address in the IP packet and try to select local address that we prefer for talks with the target host. Such local address is selected by looking for primary IP addresses on all our subnets on the outgoing interface that include the target IP address. If no suitable local address is found we select the first local address we have on the outgoing interface or on all other interfaces, with the hope we will receive reply for our request and even sometimes no matter the source IP address we announce.

This one is a little tricky, but I believe it deals with how the web servers talk with the clients requesting web pages. In order for the web page to come up and maintain a session, when the server sends a packet back to the client, it needs to come from the source ip address of the hidden ip address. In order to do this, the web server looks at the destination address of the client packet, and then responds using that as it’s source IP address instead of it’s actual IP address. Clear as mud, right!

I hope this helps explain things a little better about the hows and whys of the web server’s side of load balancing. Note however, that I didn’t talk at all about the edge server. That’s because the edge servers job is done at the application level, and correct configuration of it does not require special consideration at the OS level.


September 2, 2009

Comparing two server operating systems, like SuSE Linux Enterprise Server (SLES) and RedHat Enterprise Linux (RHEL), needs to answer one question, “what do we want to do with the overall system”? The version of Linux running underneath the application is immaterial, as long as the application supports that version. It is my opinion that we should choose the OS that supports all of our applications, and gives us the best value for our money.


RHEL advertises “Unlimited Virtual Machines”, but what they put in the small print is that the number of virtual machines you can run is unlimited only if you are using their Xen virtualization. We already have a significant investment in both money and knowledge in VMWare, so the RHEL license doesn’t apply, and we have to purchase a new license for each virtual machine. There is an option to purchase a RHEL for VMWare license, but it is expensive, and still limits you to a maximum of 10 virtual machines per-server.

SLES allows unlimited virtual machines regardless of the virtualization technology used. SLES also has a special license for an entire blade center, which (and I’d have to double check on this fact) may let us license the blade center, and purchase additional blades without having to license those blades separately. This license would allow us to run unlimited virtual machines and add physical capacity to the blade center as needed. This is the license we have for one of our blade centers, and I believe it cost $4500 for a three year contract. As I understand it, that means that for the 9 blades we have in it now, we spent $500 each for a three year license, which equates to $167 per blade per year. We also have the ability to add an additional five blades to this blade center, which would also be covered under the agreement. Doing so would bring our total per blade per year cost down to $108, for unlimited virtual machines.

For a comparison, right now, if we want to bring up a new RHEL server in our environment, we have to purchase another minimal RHEL license for $350, more if we actually want support and not just patches.

Even without the special blade center pricing (which may be IBM only), a single license for SLES priced by CDW costs $910 for three years. So, for $304 per blade per year, we can license two blade centers for $6,080 annually, which will cover all virtual machines. That price is off-the-shelf, so I’m sure our vendors could lower the price even more. In another pre-production environment, which resides on three physical servers running VMWare, there are 40 virtual machines, which, if we migrate them to REHL, would cost $14,000 annually.

Related to base price is what is included with the base price. SLES gives you the option to create a local patching mirror and synchronize regularly with their servers. This same functionality is available for RHEL as the “RHN Satellite Server” at a cost of $13,500, annually.


As far as I can tell, neither RHEL or SLES have a significant performance advantage. However, SLES has the option to do a very bare-bones, minimal install with no reliance on a graphical user interface. RHEL requires either a remote or local X windows session running to access its management tools. There are versions of the management tools in the command line, but they are either marked as depreciated, or do not offer all of the options of the GUI.

One of our environments is run on SLES 9, and another ran on SLES 8 for several years and all systems have had excellent performance.


RHEL has no YaST equivalent, and the individual command line configuration tools do not have all of the options of their GUI counterparts. To effectively manage RHEL, we either have to keep an X server running locally and tunnel X, or use the old school Unix tools, and edit text files. Also, RedHat keeps its text files in several different places, and it has taken us a lot of trial and error to find out which one is right. Admittedly, that’s more of an annoyance than anything, but it still takes time.

RHEL has major problems with LDAP. We had an outage on a database server that was a result of an improper LDAP configuration, the same LDAP config we have on all of the other servers. RHEL was attempting to authenticate a local daemon that inspects hardware against LDAP, before the NIC card was even discovered, much less started. I can think of no good reason that would ever be an option.

I’m not sure how RedHat is competing these days. Cent OS and Scientific Linux distribute the source of RHEL for free, Oracle has a lower price option, and Novell’s SLES kills RedHat in pricing. It almost seems to me that RedHat is living on it’s name alone.

Writing about Jekyll

August 25, 2009

I’m writing an article for TAB about my new blogging engine, Jekyll. I’ve taken most of the reliance on the command line out of dealing with Jekyll on a day to day basis, and instead have a few Automator workflows in the scripts menu in the Mac menubar. It’s a great setup, I’m really enjoying it. I’m sure there will be quite a bit of enhancement yet to come, but my initial workflow looks like this:

  1. Click “New Blog Post”
  2. Write the article
  3. Click “Run Jekyll”
  4. Make sure everything worked using the local webrick web server.
  5. Click “Kill Jekyll”
  6. Click “Sync Site”

Here’s what I’ve got so far in the automator workflows:

New Blog Post

First, I run the “Ask for Text” action to get the name of the post. Then, I run this script:

NAME=`echo $1 | sed s/\ /-/g`
POSTNAME=`date "+%Y-%m-%d"-$NAME`
touch $POST_FQN
echo "---" >> $POST_FQN
echo "layout: post" >> $POST_FQN
echo "title: $1" >> $POST_FQN
echo "---" >> $POST_FQN
/usr/bin/mate $POST_FQN

Run Jekyll

First, I run this script:

cd /Users/$USERNAME/Sites
/usr/bin/jekyll > /dev/null
/usr/bin/jekyll --server  > /dev/null 2>&1 &
/usr/local/bin/growlnotify --appIcon Automator Jekyll is Done -m 'And there was much rejoicing.'
echo "http://localhost:4000"

Followed by the “New Safari Document” Automator action. This runs Jekyll which converts the blog post I just wrote in markdown syntax to html, updates the site navigation, starts the local web server and opens the site in Safari to preview.

Kill Jekyll

Since I start the local server in the last step, I need to kill it in this step. This action does just that.

PID=`ps -eaf | grep "jekyll --server" | grep -v grep | awk '{ print $2 }'`
kill $PID
/usr/local/bin/growlnotify --appIcon Automator Jekyll is Dead -m 'Long Live Jekyll.'

This is entered in as a shell script action, and is the only action in this workflow.

Sync Site

Once I’m certain everything looks good, I run the final Automator action to upload the site:

cd /Users/USERNAME/Sites/_site/
rsync -avz -e ssh . USERNAME@jonathanbuys.com:/home/USERNAME/jonathanbuys.com/ > /dev/null
/usr/local/bin/growlnotify --appIcon Automator Site Sync Complete -m 'Check it out.'

This is also a single Automator action workflow. You’ll notice that I use Growl to notify me that the script is finished. This is also not really necessary, but it’s fun anyway.

Like I said, there’s a lot of improvement yet to go, but I think it’s a solid start. I’m at a point now where I’m tempted to start writing a Wordpress import feature, which seems to be the only major piece missing from the Jekyll puzzle. I’m not sure what this would take just yet, but I’ve got a few ideas. I haven’t tried uploading any images or media yet, but since everything is static, I assume it would just be a matter of placing the image in a /images folder and embedding it in html. So far, I’m having a lot of fun, and that’s what blogging is really all about.

The Unix Love Affair

August 10, 2009

There’s been times when I’ve walked away from the command line, times when I’ve thought about doing something else for a living. There’s even been brief periods of time when I’ve flirted with Windows servers. However, I’ve always come back to Unix, in one form or another. Starting with Solaris, then OpenBSD, then every flavor of Linux under the sun, to AIX, and back to Linux. Unix is something that I understand, something that makes sense.

Back in ‘96 when I started in the tech field, I discovered that I have a knack for understanding technology. Back then it was HF receivers and transmitters, circuit flow and 9600 baud circuits. Now I’m binding dual gigabit NICs together for additional bandwidth and failover in Red Hat. The process, the flow of logic, and the basics of troubleshooting still remain the same.

To troubleshoot a system effectively, you need to do more than just follow a list of pre-defined steps. You need to understand the system, you need to know the deep internals of not only how it works, but why. In the past 13 years of working in technology, I’ve found that learning the why is vastly more valuable.

Which brings me back to why I love working with Unix systems again. I understand why they act the way that they do, I understand the nature of the behavior. I find the layout of the filesystem to be elegant, and a minimally configured system to be best. I know that there are a lot of problems with the FSH, and I know that it’s been mangled more than once, but still. In Unix, everything is configured with a text file somewhere, normally in /etc, but from time to time somewhere else. Everything is a file, which is why tools like lsof work so well.

Yes, Unix can be frustrating, and yes, there are things that other operating systems do better. It is far from perfect, and has many faults. But, in the end, there is so much more to love about Unix then there is to hate.

Slowly Evolving an IT System

July 18, 2009

We are going through a major migration at work, upgrading our four and a half year old IBM blades to brand spanking new HP BL460 G6’s. We run a web infrastructure, and the current plan is to put our F5’s, application servers, and databases in place, test them all out, and then take a downtime to swing IPs over and bring up the new system. It’s a great plan, it’s going to work perfectly, and we will have the least amount of downtime with this plan. Also… I hate it.

The reason I hate it has more to do with technical philosophy then with actual hard facts. I prefer a slow and steady evolution, a recognition that we are not putting in a static system, but a living organism who’s parts are made up of bits and silicone. What I’d like to do is put in the database servers first, then swing over the application servers, and then the F5, which is going to replace our external web servers and load balancers. One part at a time, and if we really did it right, we could do each part with very little downtime at all. However, I can see the point in putting in everything at once, you test the entire system from top to bottom, make sure it works, and when everyone is absolutely certain that all the parts work together, flip the switch and go live. But… then what.

What about six months down the road when we are ready to add capacity to the system, what about adding another database server, what about adding additional application servers to spread out the load, what about patches?

Operating systems are not something that you put into place and never touch again. IT systems made up of multiple servers should not be viewed as fragile, breakable things that should not be touched. We can’t set this system up and expect it to be the same three years from now when the lease on the hardware is up. God willing, it’s going to grow, flourish, change.

Our problems are less about technology, and more about our corporate culture.

Teach A Man To Fish

July 13, 2009

As a general rule, I really don’t like consultants. Not that I have anything against any of them personally, it’s just that as a whole, most consultants I’ve worked with are no better than our own engineers and administrators. The exception that proves this rule is our recent VMWare consultant, who was both knowledgeable and willing to teach. Bringing in an outside technical consultant to design, install, or configure a software system is admitting that not only do we as a company not know enough about the software, we don’t plan on learning enough about it either. Bringing in a consultant is investing in that companies knowledge, and not investing in our own.

It costs quite a bit of money to bring in a consultant, they do not come cheap. For one, if there is no one local you have to pay for travel and lodging. Most consultants charge by the hour, so you have billable time bringing them up to speed on what you new system is and what you are trying to accomplish with it so they can start helping you. If you are bringing in a consultant for an IBM product, you need to be prepared to sit on the phone with him and put in several PMRs.

I would rather spend the equivalent amount of money on sending employees to training than on a consultant. Once a consultant leaves after performing their task, the regular employees who maintain the system are on their own, and without the appropriate training they are often lost. When the consultant leaves, he takes all of his expertise with him. Expertise that was used to set up a system that he has no personal stake in, other than his reputation as a consultant, which may or may not matter depending on the relationship between the two companies. When you send employees to training on a new software product or technology, you are building that same expertise internally. Initially, the internal expertise will not be on the same level as that of the consultants, but over time as the employees administer the system that they built, their expertise grows deeper and stronger.

Teams that are experts on the systems they are in charge of can build on that system. They can recognize shortcomings and bottlenecks, and troubleshoot problems faster than on systems that they simply maintain. They know the internal architecture of the system, not only how it works, but why it works.

In the Navy, I was lucky enough to work for a Senior Chief who believed that we needed to be experts on the systems we managed, because once we were out to sea, no one was going to come out to help us. He sent us to training, or brought training in, two or three times a year, for one to two week sessions on everything from Unix to Exchange. This same mindset could apply equally as well to companies who operate 24x7 infrastructures. Once the system crashes at 2AM, there’s not going to be anyone there to help you, your team will be on their own, and if you have not invested in the team, it’s going to be a very long night.

If your company is not going to invest in you, you need to invest in yourself.

Regarding OS Zealotry

July 9, 2009

Today I found myself in the unfortunate situation of defending Linux to a man I think I can honestly describe as a Windows zealot. I hate doing this, as it leads to religious wars that are ultimately of no use, but it’s really my own fault for letting myself be sucked into it. It started when we were attempting to increase the size of a disk image in vmware, while Red Hat guest was running. It didn’t work, and we couldn’t find any tools to rescan the scsi bus, or anything else to get Linux to recognize that the disk was bigger. I was getting frustrated, and the zealot began to laugh, saying how easy this task was in Windows. Obviously, I felt slighted since I’m one of the Unix admins at $work, and decided I needed to defend the operating system and set of skills that pays the bills here at home. And so, we started trading snide remarks back and forth about Linux and Windows.

At one point, I told him that since Windows was so easy, MCSE’s were a dime a dozen. This is probably wrong, I don’t have anything to back it up with. Really, the entire argument was wrong, and I was dragged down to the level of grade school arguing about who’s OS was “better”. The entire thing was pointless, and wound up costing more time and effort that should have been spent on the task at hand. After giving it some thought while doing yard work this evening, I’ve decided to get out of OS debates all together.

I’ve worked as a Windows admin in the past, I even took a few tests and earned the MCSA certification back in 2002. I don’t have anything against Windows, but I don’t personally feel that it’s a technically superior server to Linux or even AIX. Windows has some administrative tools that make me drool in envy. I wish I could set up group policy, I wish I could get a Linux host to authenticate centrally as easy as it is to have a Windows server join a domain, and evidently disk management is extremely easy now. However, the real strengths of Linux are not that it is easy to use, or easy to administer. The strengths of Linux is in its stability and security.

Case in point: I’ve personally seen web hosting environments built on a default install of SLES 8 that were not patched for four and a half years, and never had a problem. Best practice? Of course not, but it worked. I’m not sure I could say the same for Windows in that same situation. Another example, another place I worked had a Linux web server who’s root partition was 100% full. This particular server was not built with LVM, so we couldn’t just extend the disk, and we also couldn’t just delete data, since we didn’t know what was needed and what was not. This server kept up and running for at least a year, and may still be running now, happily serving up web pages with a full root partition. What happens if you fill up the C:\ drive of a Windows server? I’m thinking that it crashes, but I’m not sure.

So, is a Linux server “better” than a Windows server? Is Windows “better”? In this, as in most things, the answer is: it depends. Both systems come at things from a different direction, and each show their strengths and weaknesses differently. In my experience, I’ve gained a respect for both. I prefer Linux, because honestly, I think it’s more fun.

I’ve really only worked with one other Windows zealot, and we used to argue over the use of Linux on the desktop. Linux on the desktop and Linux on the server are two totally different animals. Sure, they use the same kernel, and same basic userland apps in the shell, but other than that, they have different purposes. Arguments against using Linux on the desktop are more often than not aimed at Gnome or KDE, and not at the actual Linux OS underneath. I come to the same conclusion there as well though, certainly some things do not work as well in Linux as they do in Windows. Some things work better, but all in all, I just think Linux is more interesting.

I think what our argument came down to today was that he doesn’t understand why anyone would use Linux, since Windows, in his opinion, is so much better at everything than Linux. I think a little bit of professional courtesy would have gone a long ways here, but its just as much my fault for continuing the argument as it was his. My position on the comparison is this: yes, some things are much, much harder in Linux than in Windows… but, it’s so much more fun managing Linux. A stripped down, well oiled Linux server can be a screamer for performance and reliability. Is it easy getting there? No, but it’s worth the extra effort.