This Should Be Default

Posted on Wednesday 30 December 2009

I’m crawling out of my cave to make a quick point before I get back to coding. In Cocoa, if you want to have your window shown when you click on its icon in the Dock, you need to add this method to the app delegate:

- (BOOL)applicationShouldHandleReopen:(NSApplication *)theApplication
                    hasVisibleWindows:(BOOL)flag
{
    if( !flag )
        [window makeKeyAndOrderFront:nil];

return YES;

}

Seriously, this should be included in the default app delegate template generated by Xcode. I can’t think of a single reason why its not.

Jon @ 6:23 am
Posted under: Technology
Managing Nagios Configs

Posted on Wednesday 9 December 2009

We don’t have a very big Nagios installation, comparatively anyway, but it is big enough to find that the default layout for configurations is insane. I tried using the provided layout, until I wound up with single text files with thousands of lines in them. This made it very hard to do individual customizations for servers, and separating out who wants to be notified for what. Here is what I came up with for managing our Nagios configs.

It seems that the repositories are always behind in Nagios, so it is one of the very few apps that I recommend installing from source. I install Nagios in /usr/local/nagios, the default when compiling, I’ll just call it $nag. The Nagios binary is in $nag/bin, the plugins in $nag/libexec, and the config files in $nag/etc. The easiest way to understand nagios is to follow its start up procedures. I keep an /etc/init.d/nagios file for initialization, The file defines, among other things, where the home directory for Nagios is, what config file to use as its base, and where the Nagios binary and plugins are. The important thing to understand is that this file is the first pointer in a long string of pointers that Nagios uses for configuration.

Inside the nagios.cfg file are the cfg_dir directives. These are pointers that tell Nagios that it can find additional configurations inside the directories listed. Once Nagios is given a directory to look at, it will read each file ending in .cfg inside of that directory. The first directory that I have listed is $nag/etc/defaults. I keep four files in this directory: commands.cfg, dependencies.cfg, generic.cfg, and timeperiods.cfg.

The file “commands.cfg” contains the definitions of all check commands that Nagios can understand. They look like this:

 # 'check_local_load' command definition
 define command{
        command_name    check_local_load
        command_line    $USER1$/check_load -w $ARG1$ -c $ARG2$
        }

The file also contains the alert commands, or what Nagios will do when it finds something that it needs to let you know about:

define command{
command_name notify-by-email
command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type:  $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress:  $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/   bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **"    $CONTACTEMAIL$
}

This allows us to call a command later in Nagios by it’s defined command_name,such as check_local_load, instead of having to call the entire command including arguments. Keeps the configs clean.

The next file, “generic.cfg”, contains templates for host configurations. This file allows us to do two things: list common options that are defined for all of the hosts, and separate hosts into notification groups. The definitions look like this:

define host{
        name                            generic-admin
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1
        register                        0
        check_command           check-host-alive
        max_check_attempts      3
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
                contact_groups          admin,admin_pager
        action_url /nagios/pnp/index.php?host=$HOSTNAME$
        }

There are two separate types of generic definitions, hosts and services, for the two types of monitoring that Nagios does. The important section for most of my purposes above is the “contact_groups” line. This allows me to group contacts with hosts, so it answers the question of “who gets notified if this server goes down?”. The same thing applies to the service template below.

define service{
        name                            generic-full    
        active_checks_enabled           1
        passive_checks_enabled          1
        parallelize_check               1
        obsess_over_service             1
        check_freshness                 0
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1
        register                        0
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        notification_interval           120
        notification_period             24x7
        notification_options            w,c,r
    contact_groups                  admins,admin_pager,webmin
    process_perf_data 1
    action_url /nagios/pnp/index.php?host=$HOSTNAME$&srv=$SERVICEDESC$
        }

The other two files, timeperiods.cfg and dependencies.cfg, I haven’t done a whole lot with yet.

The next directory parsed as defined in nagios.cfg is $nag/etc/users, which, surprisingly enough, is where all of the users are defined. I keep two files in this directory, users.cfg and contactgroups,cfg. The users.cfg file contains a list of every user, and since I have different needs for pagers and regular email alerts, each user is defined twice:

define contact{
        contact_name                    Jon
        alias                           Jon Buys
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r
        host_notification_options       d,u,r
        service_notification_commands   service-notify-by-email
        host_notification_commands      host-notify-by-email
        email                           jbuys@dollarwork.com
        }

define contact{
        contact_name                    Jon_pager
        alias                           Jon Buys
      service_notification_period     24x7
      host_notification_period        24x7
      service_notification_options    u,c,r
      host_notification_options       d,u,r
      service_notification_commands   notify-for-disk
      host_notification_commands      host-notify-by-email
      email                 5555555555@my.phone.company.net
      }

This lets me group the users more effectively in the second file, contactgroups.cfg:

define contactgroup{
        contactgroup_name admins
      alias           sysadmins
      members Jon,Gary,nagios_alerts
      }

define contactgroup{
        contactgroup_name admin_pager
        alias           sysadmin pagers
        members Jon_pager,Gary_pager,OSS_Primary_Phone,nagios_alerts
}

Now, check the definitions in the generic.cfg file above, and you’ll start to see the chain of config files coming together. The glue sticking it all together is the server definition files. Each logical group of servers gets their own directory, defined in nagios.cfg. For example, we have a group of servers that provides a specific web service (which I’ll call “mesh”), there are web servers, application servers, and database servers that I group together in one directory, named “mesh”. Inside of this directory, each server has its own config file, named like $hostname.cfg. There is also a mesh.cfg, which groups all of the servers together in a host group. The $hostname.cfg files look like this:

 define host{
        use                     generic-host
      host_name                 m-app1 
      alias                   m-app1
      address                 10.10.10.1
      }

define service{
      use                             generic-full
      host_name                       m-app1
      service_description             PING
      check_command                   check_ping!100.0,20%!500.0,60%
      }

define service{
      use                             generic-full
      host_name                       m-app1
      service_description             DISKUSE
      check_command                   check_nrpe!check_df
      }

Each server has a host definition at the top, and all of the services that are monitored on that server at the bottom. The first section’s line “use generic-host” calls the “generic-host” template from the generic.cfg file above. Each subsequent “define service” section has a “use” line that also calls the templates defined in generic.cfg. Putting each server in its own file makes it very easy to add and remove servers from Nagios. To remove them, just remove (or, safer, rename) the $hostname.cfg file and delete the name from the $groupname.cfg file. It’s also very easy to script the creation of new hosts given a list of host names and IP addresses.

The mesh.cfg file contains the hostgroup configuration for the group:

define hostgroup{
     hostgroup_name  mesh
     alias           Mesh Production
     members         mdbs1,mdbs2,mdbs3,mdbs4,mdbs5,mdbs6,mdbs7,m-app1,m-app2,m-app3,m-store1,m-store2,m-nfs1,m-nfs2
     }

This file is not as important, but it makes the Nagios web interface a little more helpful.

You’ll also notice that the check_command line above contains “check_nrpe!check_df”. This means that I use the nrpe (Nagios Remote Plugin Execution) add-on to actually monitor the services on the remote hosts. Each server has nrpe installed, and has one configuration file (/usr/local/nagios/etc/nrpe.cfg). The nrpe.cfg file has a corresponding line that says

command[check_df]=/usr/local/nagios/libexec/check_disk -e -L -w 6% -c 4%

This translates the check_df command sent by the check_nrpe command into the longer command defined above. This makes it easy to install and configure nrpe once, then zip up the /usr/local/nagios directory and unzip it on all new servers.

Nagios is nearly limitless in its abilities, but but because of the complexity of its configuration it can be daunting to newcomers. This setup is designed to make it just a little bit easier to understand, and easier to script.

Jon @ 8:43 am
Posted under: Technology
Blizzard 2009

Posted on Wednesday 9 December 2009


100_1931, originally uploaded by jonbuys.

Iowa got its first big storm of the winter season yesterday, and as of right now its still going on. We couldn’t go anywhere even if we wanted to. We got about 13″ of snow so far, but the wind gusts up to 50mph are the big problem. Just about everything is shut down, schools, work places, and even some of the larger roads.

Good day to get caught up on somethings I’ve been meaning to get done.

Jon @ 7:30 am
Posted under: Uncategorized
PHOTO: Meetings: The practical alternative to work. (via Ariel)

Posted on Tuesday 8 December 2009

PHOTO: Meetings: The practical alternative to work. (via Ariel):

meetings-oldtime-ad.png

Meetings: The practical alternative to work. (via Ariel)

(Via Signal vs. Noise.)

Jon @ 7:46 pm
Posted under: Shared
New SysAdmin Tips

Posted on Friday 4 December 2009

My answer to a great question over at serverfault.

First off, find your logs. Most Linux distros log to /var/log/messages, although I’ve seen a couple log to /var/log/syslog. If something is wrong, most likely there will be some relevant information in the logs. Also, if you are dealing with email at all, don’t forget /var/log/mail. Double-check your applications, find out if any of them log somewhere ridiculous, outside of syslog.

Brush up on your vi skills. Nano might be what all the cool kids are using these days, but experience has taught me that vi is the only text editor that is guaranteed to be on the system. Once you get used to the keyboard shortcuts, and start creating your own triggers, vi will be like second nature to you.

Read the man page, and then run the following commands on each machine, and copy the results into your documentation:

hostname
cat /etc/release
cat /etc/hosts
cat /etc/resolv.conf
cat /etc/nsswitch
df -h
ifconfig -a
free -m
crontab -l
ls /etc/cron.d
echo $SHELL

That will serve as the beginnings of your documentation. Those commands let you know your environment, and can help narrow down problems later on.

Grep through your logs and search for “error” or “failed”. That will give you an idea of what’s not working as it should. Your users will give you their opinion on whats wrong, listen closely to what they have to say. They don’t understand the system, but they see it in a different way than you do.

When you have a problem, check things in this order:

  1. Disk Space (df -h): Linux, and some apps that run on Linux, do some very strange things when disk space runs out. It may seem unrelated, until you check and find a filesystem 100% full.

  2. Top: Top will let you know if you’ve got some process that’s stuck out there eating up all of your available CPU cycles. Nothing should consume 99% CPU for any extended period of time. If its a legitimate process, it should probably fluctuate up and down. While you are in top, check…

  3. System Load: The system load should normally be below 3 on a standard server or workstation. The system load is based on CPU, memory, and I/O.

  4. Memory (free -m): RAM use in Linux is a little different. It’s not uncommon to see a server with nearly all of its RAM used up. Don’t Panic, if you see this, it’s mostly just cache, and will be cleared out as needed. However, pay close attention to the amount of swap in use. If possible, keep this as close to zero as you can. Insufficient memory can lead to all kinds of performance problems.

  5. Logs: Go back to your logs, run tail -500 /var/log/messages | more and start reading through and seeing what’s been going on. Hopefully, the logs will be able to point you in the direction you need to go next.

A well maintained Linux server can run for years without problems. We just shut one down that had been running for 748 days, and we only shut it down because we had migrated the application over to new hardware. Hopefully, this will help you get your feet wet, and get you off to a good start.

One last thing, always make a copy of a config file you intend to change, and always copy the line you are changing, and comment out the original, adding your reason for changing it. This will get you into the habit of documenting as you go, and may save your hide 9 months down the road.

Jon @ 1:55 pm
Posted under: Personal and Technology
SLES and RHEL

Posted on Friday 4 December 2009

Comparing two server operating systems, like SuSE Linux Enterprise Server (SLES) and RedHat Enterprise Linux (RHEL), needs to answer one question, “what do we want to do with the overall system”? The version of Linux running underneath the application is immaterial, as long as the application supports that version. It is my opinion that we should choose the OS that supports all of our applications, and gives us the best value for our money.

Price

RHEL advertises “Unlimited Virtual Machines”, but what they put in the small print is that the number of virtual machines you can run is unlimited only if you are using their Xen virtualization. We already have a significant investment in both money and knowledge in VMWare, so the RHEL license doesn’t apply, and we have to purchase a new license for each virtual machine. There is an option to purchase a RHEL for VMWare license, but it is expensive, and still limits you to a maximum of 10 virtual machines per-server.

SLES allows unlimited virtual machines regardless of the virtualization technology used. SLES also has a special license for an entire blade center, which (and I’d have to double check on this fact) may let us license the blade center, and purchase additional blades without having to license those blades separately. This license would allow us to run unlimited virtual machines and add physical capacity to the blade center as needed. This is the license we have for one of our blade centers, and I believe it cost $4500 for a three year contract. As I understand it, that means that for the 9 blades we have in it now, we spent $500 each for a three year license, which equates to $167 per blade per year. We also have the ability to add an additional five blades to this blade center, which would also be covered under the agreement. Doing so would bring our total per blade per year cost down to $108, for unlimited virtual machines.

For a comparison, right now, if we want to bring up a new RHEL server in our environment, we have to purchase another minimal RHEL license for $350, more if we actually want support and not just patches.

Even without the special blade center pricing (which may be IBM only), a single license for SLES priced by CDW costs $910 for three years. So, for $304 per blade per year, we can license two blade centers for $6,080 annually, which will cover all virtual machines. That price is off-the-shelf, so I’m sure our vendors could lower the price even more. In another pre-production environment, which resides on three physical servers running VMWare, there are 40 virtual machines, which, if we migrate them to REHL, would cost $14,000 annually.

Related to base price is what is included with the base price. SLES gives you the option to create a local patching mirror and synchronize regularly with their servers. This same functionality is available for RHEL as the “RHN Satellite Server” at a cost of $13,500, annually.

Performance

As far as I can tell, neither RHEL or SLES have a significant performance advantage. However, SLES has the option to do a very bare-bones, minimal install with no reliance on a graphical user interface. RHEL requires either a remote or local X windows session running to access its management tools. There are versions of the management tools in the command line, but they are either marked as depreciated, or do not offer all of the options of the GUI.

One of our environments is run on SLES 9, and another ran on SLES 8 for several years and all systems have had excellent performance.

Elsewheres

RHEL has no YaST equivalent, and the individual command line configuration tools do not have all of the options of their GUI counterparts. To effectively manage RHEL, we either have to keep an X server running locally and tunnel X, or use the old school Unix tools, and edit text files. Also, RedHat keeps its text files in several different places, and it has taken us a lot of trial and error to find out which one is right. Admittedly, that’s more of an annoyance than anything, but it still takes time.

RHEL has major problems with LDAP. We had an outage on a database server that was a result of an improper LDAP configuration, the same LDAP config we have on all of the other servers. RHEL was attempting to authenticate a local daemon that inspects hardware against LDAP, before the NIC card was even discovered, much less started. I can think of no good reason that would ever be an option.

I’m not sure how RedHat is competing these days. Cent OS and Scientific Linux distribute the source of RHEL for free, Oracle has a lower price option, and Novell’s SLES kills RedHat in pricing. It almost seems to me that RedHat is living on it’s name alone.

Jon @ 1:54 pm
Posted under: Technology
Writing about Jekyll

Posted on Tuesday 25 August 2009

I’m writing an article for TAB about my new blogging engine, Jekyll. I’ve taken most of the reliance on the command line out of dealing with Jekyll on a day to day basis, and instead have a few Automator workflows in the scripts menu in the Mac menubar. It’s a great setup, I’m really enjoying it. I’m sure there will be quite a bit of enhancement yet to come, but my initial workflow looks like this:

  1. Click “New Blog Post”
  2. Write the article
  3. Click “Run Jekyll”
  4. Make sure everything worked using the local webrick web server.
  5. Click “Kill Jekyll”
  6. Click “Sync Site”

Here’s what I’ve got so far in the automator workflows:

New Blog Post

First, I run the “Ask for Text” action to get the name of the post. Then, I run this script:

NAME=echo $1 | sed s/\ /-/g
USERNAME=whoami
POSTNAME=date "+%Y-%m-%d"-$NAME
POST_FQN=/Users/$USERNAME/Sites/_posts/$POSTNAME.markdown
touch $POST_FQN
echo "---" >> $POST_FQN
echo "layout: post" >> $POST_FQN
echo "title: $1" >> $POST_FQN
echo "---" >> $POST_FQN
/usr/bin/mate $POST_FQN

Run Jekyll

First, I run this script:

USERNAME=whoami
cd /Users/$USERNAME/Sites
/usr/bin/jekyll > /dev/null
/usr/bin/jekyll --server  > /dev/null 2>&1 &
/usr/local/bin/growlnotify --appIcon Automator Jekyll is Done -m 'And there was much rejoicing.'
echo "http://localhost:4000"

Followed by the “New Safari Document” Automator action. This runs Jekyll which converts the blog post I just wrote in markdown syntax to html, updates the site navigation, starts the local web server and opens the site in Safari to preview.

Kill Jekyll

Since I start the local server in the last step, I need to kill it in this step. This action does just that.

PID=ps -eaf | grep "jekyll --server" | grep -v grep | awk '{ print $2 }'
kill $PID
/usr/local/bin/growlnotify --appIcon Automator Jekyll is Dead -m 'Long Live Jekyll.'

This is entered in as a shell script action, and is the only action in this workflow.

Sync Site

Once I’m certain everything looks good, I run the final Automator action to upload the site:

cd /Users/USERNAME/Sites/_site/
rsync -avz -e ssh . USERNAME@jonathanbuys.com:/home/USERNAME/jonathanbuys.com/ > /dev/null
/usr/local/bin/growlnotify --appIcon Automator Site Sync Complete -m 'Check it out.'

This is also a single Automator action workflow. You’ll notice that I use Growl to notify me that the script is finished. This is also not really necessary, but it’s fun anyway.

Like I said, there’s a lot of improvement yet to go, but I think it’s a solid start. I’m at a point now where I’m tempted to start writing a Wordpress import feature, which seems to be the only major piece missing from the Jekyll puzzle. I’m not sure what this would take just yet, but I’ve got a few ideas. I haven’t tried uploading any images or media yet, but since everything is static, I assume it would just be a matter of placing the image in a /images folder and embedding it in html. So far, I’m having a lot of fun, and that’s what blogging is really all about.

Jon @ 12:38 pm
Posted under: Technology
The Unix Love Affair

Posted on Monday 10 August 2009

There’s been times when I’ve walked away from the command line, times when I’ve thought about doing something else for a living. There’s even been brief periods of time when I’ve flirted with Windows servers. However, I’ve always come back to Unix, in one form or another. Starting with Solaris, then OpenBSD, then every flavor of Linux under the sun, to AIX, and back to Linux. Unix is something that I understand, something that makes sense.

Back in ‘96 when I started in the tech field, I discovered that I have a knack for understanding technology. Back then it was HF receivers and transmitters, circuit flow and 9600 baud circuits. Now I’m binding dual gigabit NICs together for additional bandwidth and failover in Red Hat. The process, the flow of logic, and the basics of troubleshooting still remain the same.

To troubleshoot a system effectively, you need to do more than just follow a list of pre-defined steps. You need to understand the system, you need to know the deep internals of not only how it works, but why. In the past 13 years of working in technology, I’ve found that learning the why is vastly more valuable.

Which brings me back to why I love working with Unix systems again. I understand why they act the way that they do, I understand the nature of the behavior. I find the layout of the filesystem to be elegant, and a minimally configured system to be best. I know that there are a lot of problems with the FSH, and I know that it’s been mangled more than once, but still. In Unix, everything is configured with a text file somewhere, normally in /etc, but from time to time somewhere else. Everything is a file, which is why tools like lsof work so well.

Yes, Unix can be frustrating, and yes, there are things that other operating systems do better. It is far from perfect, and has many faults. But, in the end, there is so much more to love about Unix then there is to hate.

Jon @ 7:04 pm
Posted under: Technology