Enterprise Software Again
I realized today that it’s been ten years since I dedicated an entire post to complaining about enterprise software. In that ten years not much has changed, unfortunately. Enterprise software is still crap, and it’s still more of a hassle than it’s worth. It’s best to avoid whenever possible, so when you find yourself evaluating software or services for your company, here’s a few easy markers to identify the products you should let pass by.
cloudchain
Today, the team I’m a part of at TargetSmart is releasing our first open source project, a bit of Python I like to call “cloudchain”. cloudchain is designed to make it easy to store and retrieve secrets using AWS. cloudchain relies on the AWS Identity and Access Management (IAM) Key Management Service (KMS) to securely store and manage access to encryption keys, and stores the encrypted secret in a DynamoDB table.
Parsing iostat Results
In the course of load testing a new system, we gathered the output from iostat from a group of servers. In addition to parsing through the device statistics, we thought it would be handy to graph the CPU stats as well. We set iostat to run every five seconds and captured the output in a text file, one per server. This gave me a sizable pool of data, but with everything I needed on separate lines.
The Hardware Racket
Every now and then something just gets to me, and for the past few weeks, that something has been the process of purchasing enterprise hardware. Servers, SANs, load balancers, the kind of equipment that, instead of a price and an “Add to Cart” link, comes with directions on who to call.
Cutting Corners
After reading the MacSparky piece on craftsmanship, I’m reminded of how I like to look at my career as a systems administrator. I find that there are times when things that are not quite right just bother me. Like when there are inconsistencies or one-offs scattered throughout the environment I am responsible for. There may well be perfectly logical reasons why some systems are monitored and some are not, why some are registered with configuration management and others are not, but in my mind it is these little inconsistencies that add up and make your work look sloppy.
Solving The Right HA Problem
High Availability, HA for short, refers to an applications ability to continue operating after a hardware failure. HA comes in many different shapes and sizes, but two methods in production today are the presence of multiple machines performing the same task, and pairs of machines in a master-slave setup. Sometimes the master-slave setup is extended to include several slaves, but the main idea is that if the master should go away, the slave will pick up where the master left off, with no interruption in service.
Quality
We’ve been having a months long discussion at work around which Linux OS to use. It’s all come to a head recently, and it looks like the winner is going to be Red Hat. The decision leaves a slightly sour taste in my mouth, but over the course of the past year I’ve gotten used to having it around. While trying to understand why I’ve got such a dislike for this particular flavor of Linux, I thought it might help to take another look at OpenBSD.
Add a User - Send an Email
I was asked on Twitter the other day why I disliked IBM’s enterprise software. This, in addition to my previous TWS rant, is my answer to that question.
Managing Nagios Configs
We don’t have a very big Nagios installation, comparatively anyway, but it is big enough to find that the default layout for configurations is insane. I tried using the provided layout, until I wound up with single text files with thousands of lines in them. This made it very hard to do individual customizations for servers, and separating out who wants to be notified for what. Here is what I came up with for managing our Nagios configs.
It seems that the repositories are always behind in Nagios, so it is one of the very few apps that I recommend installing from source. I install Nagios in /usr/local/nagios, the default when compiling, I’ll just call it $nag. The Nagios binary is in $nag/bin, the plugins in $nag/libexec, and the config files in $nag/etc. The easiest way to understand nagios is to follow its start up procedures. I keep an /etc/init.d/nagios file for initialization, The file defines, among other things, where the home directory for Nagios is, what config file to use as its base, and where the Nagios binary and plugins are. The important thing to understand is that this file is the first pointer in a long string of pointers that Nagios uses for configuration.
Inside the nagios.cfg file are the cfg_dir directives. These are pointers that tell Nagios that it can find additional configurations inside the directories listed. Once Nagios is given a directory to look at, it will read each file ending in .cfg inside of that directory. The first directory that I have listed is $nag/etc/defaults. I keep four files in this directory: commands.cfg, dependencies.cfg, generic.cfg, and timeperiods.cfg.
The file “commands.cfg” contains the definitions of all check commands that Nagios can understand. They look like this:
# 'check_local_load' command definition
define command{
command_name check_local_load
command_line $USER1$/check_load -w $ARG1$ -c $ARG2$
}
The file also contains the alert commands, or what Nagios will do when it finds something that it needs to let you know about:
define command{
command_name notify-by-email
command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/ bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
}
This allows us to call a command later in Nagios by it’s defined command_name,such as check_local_load, instead of having to call the entire command including arguments. Keeps the configs clean.
The next file, “generic.cfg”, contains templates for host configurations. This file allows us to do two things: list common options that are defined for all of the hosts, and separate hosts into notification groups. The definitions look like this:
define host{
name generic-admin
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
register 0
check_command check-host-alive
max_check_attempts 3
notification_interval 120
notification_period 24x7
notification_options d,u,r
contact_groups admin,admin_pager
action_url /nagios/pnp/index.php?host=$HOSTNAME$
}
There are two separate types of generic definitions, hosts and services, for the two types of monitoring that Nagios does. The important section for most of my purposes above is the “contact_groups” line. This allows me to group contacts with hosts, so it answers the question of “who gets notified if this server goes down?”. The same thing applies to the service template below.
define service{
name generic-full
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
register 0
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
notification_interval 120
notification_period 24x7
notification_options w,c,r
contact_groups admins,admin_pager,webmin
process_perf_data 1
action_url /nagios/pnp/index.php?host=$HOSTNAME$&srv=$SERVICEDESC$
}
The other two files, timeperiods.cfg and dependencies.cfg, I haven’t done a whole lot with yet.
The next directory parsed as defined in nagios.cfg is $nag/etc/users, which, surprisingly enough, is where all of the users are defined. I keep two files in this directory, users.cfg and contactgroups,cfg. The users.cfg file contains a list of every user, and since I have different needs for pagers and regular email alerts, each user is defined twice:
define contact{
contact_name Jon
alias Jon Buys
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
service_notification_commands service-notify-by-email
host_notification_commands host-notify-by-email
email jbuys@dollarwork.com
}
define contact{
contact_name Jon_pager
alias Jon Buys
service_notification_period 24x7
host_notification_period 24x7
service_notification_options u,c,r
host_notification_options d,u,r
service_notification_commands notify-for-disk
host_notification_commands host-notify-by-email
email 5555555555@my.phone.company.net
}
This lets me group the users more effectively in the second file, contactgroups.cfg:
define contactgroup{
contactgroup_name admins
alias sysadmins
members Jon,Gary,nagios_alerts
}
define contactgroup{
contactgroup_name admin_pager
alias sysadmin pagers
members Jon_pager,Gary_pager,OSS_Primary_Phone,nagios_alerts
}
Now, check the definitions in the generic.cfg file above, and you’ll start to see the chain of config files coming together. The glue sticking it all together is the server definition files. Each logical group of servers gets their own directory, defined in nagios.cfg. For example, we have a group of servers that provides a specific web service (which I’ll call “mesh”), there are web servers, application servers, and database servers that I group together in one directory, named “mesh”. Inside of this directory, each server has its own config file, named like $hostname.cfg. There is also a mesh.cfg, which groups all of the servers together in a host group. The $hostname.cfg files look like this:
define host{
use generic-host
host_name m-app1
alias m-app1
address 10.10.10.1
}
define service{
use generic-full
host_name m-app1
service_description PING
check_command check_ping!100.0,20%!500.0,60%
}
define service{
use generic-full
host_name m-app1
service_description DISKUSE
check_command check_nrpe!check_df
}
Each server has a host definition at the top, and all of the services that are monitored on that server at the bottom. The first section’s line “use generic-host” calls the “generic-host” template from the generic.cfg file above. Each subsequent “define service” section has a “use” line that also calls the templates defined in generic.cfg. Putting each server in its own file makes it very easy to add and remove servers from Nagios. To remove them, just remove (or, safer, rename) the $hostname.cfg file and delete the name from the $groupname.cfg file. It’s also very easy to script the creation of new hosts given a list of host names and IP addresses.
The mesh.cfg file contains the hostgroup configuration for the group:
define hostgroup{
hostgroup_name mesh
alias Mesh Production
members mdbs1,mdbs2,mdbs3,mdbs4,mdbs5,mdbs6,mdbs7,m-app1,m-app2,m-app3,m-store1,m-store2,m-nfs1,m-nfs2
}
This file is not as important, but it makes the Nagios web interface a little more helpful.
You’ll also notice that the check_command line above contains “check_nrpe!check_df”. This means that I use the nrpe (Nagios Remote Plugin Execution) add-on to actually monitor the services on the remote hosts. Each server has nrpe installed, and has one configuration file (/usr/local/nagios/etc/nrpe.cfg). The nrpe.cfg file has a corresponding line that says
command[check_df]=/usr/local/nagios/libexec/check_disk -e -L -w 6% -c 4%
This translates the check_df command sent by the check_nrpe command into the longer command defined above. This makes it easy to install and configure nrpe once, then zip up the /usr/local/nagios directory and unzip it on all new servers.
Nagios is nearly limitless in its abilities, but but because of the complexity of its configuration it can be daunting to newcomers. This setup is designed to make it just a little bit easier to understand, and easier to script.
New SysAdmin Tips
My answer to a great question over at serverfault.
First off, find your logs. Most Linux distros log to /var/log/messages, although I’ve seen a couple log to /var/log/syslog. If something is wrong, most likely there will be some relevant information in the logs. Also, if you are dealing with email at all, don’t forget /var/log/mail. Double-check your applications, find out if any of them log somewhere ridiculous, outside of syslog.
Brush up on your vi skills. Nano might be what all the cool kids are using these days, but experience has taught me that vi is the only text editor that is guaranteed to be on the system. Once you get used to the keyboard shortcuts, and start creating your own triggers, vi will be like second nature to you.
Read the man page, and then run the following commands on each machine, and copy the results into your documentation:
That will serve as the beginnings of your documentation. Those commands let you know your environment, and can help narrow down problems later on.
Grep through your logs and search for “error” or “failed”. That will give you an idea of what’s not working as it should. Your users will give you their opinion on whats wrong, listen closely to what they have to say. They don’t understand the system, but they see it in a different way than you do.
When you have a problem, check things in this order:
-
Disk Space (df -h): Linux, and some apps that run on Linux, do some very strange things when disk space runs out. It may seem unrelated, until you check and find a filesystem 100% full.
-
Top: Top will let you know if you’ve got some process that’s stuck out there eating up all of your available CPU cycles. Nothing should consume 99% CPU for any extended period of time. If its a legitimate process, it should probably fluctuate up and down. While you are in top, check…
-
System Load: The system load should normally be below 3 on a standard server or workstation. The system load is based on CPU, memory, and I/O.
-
Memory (free -m): RAM use in Linux is a little different. It’s not uncommon to see a server with nearly all of its RAM used up. Don’t Panic, if you see this, it’s mostly just cache, and will be cleared out as needed. However, pay close attention to the amount of swap in use. If possible, keep this as close to zero as you can. Insufficient memory can lead to all kinds of performance problems.
-
Logs: Go back to your logs, run tail -500 /var/log/messages more and start reading through and seeing what’s been going on. Hopefully, the logs will be able to point you in the direction you need to go next.
A well maintained Linux server can run for years without problems. We just shut one down that had been running for 748 days, and we only shut it down because we had migrated the application over to new hardware. Hopefully, this will help you get your feet wet, and get you off to a good start.
One last thing, always make a copy of a config file you intend to change, and always copy the line you are changing, and comment out the original, adding your reason for changing it. This will get you into the habit of documenting as you go, and may save your hide 9 months down the road.
Linux Hidden ARP
To enable an interface on a web server to be part of an IBM load balanced cluster, we need to be able to share an ip address between multiple machines. This breaks the IP protocol however, because you could never be sure which machine will answer for a request for that IP address. To fix this problem, we need to get down into the IP protocol and investigate how the Address Resolution Protocol or ARP, works.
SLES and RHEL
Comparing two server operating systems, like SuSE Linux Enterprise Server (SLES) and RedHat Enterprise Linux (RHEL), needs to answer one question, “what do we want to do with the overall system”? The version of Linux running underneath the application is immaterial, as long as the application supports that version. It is my opinion that we should choose the OS that supports all of our applications, and gives us the best value for our money.
The Unix Love Affair
There’s been times when I’ve walked away from the command line, times when I’ve thought about doing something else for a living. There’s even been brief periods of time when I’ve flirted with Windows servers. However, I’ve always come back to Unix, in one form or another. Starting with Solaris, then OpenBSD, then every flavor of Linux under the sun, to AIX, and back to Linux. Unix is something that I understand, something that makes sense.
Back in ‘96 when I started in the tech field, I discovered that I have a knack for understanding technology. Back then it was HF receivers and transmitters, circuit flow and 9600 baud circuits. Now I’m binding dual gigabit NICs together for additional bandwidth and failover in Red Hat. The process, the flow of logic, and the basics of troubleshooting still remain the same.
To troubleshoot a system effectively, you need to do more than just follow a list of pre-defined steps. You need to understand the system, you need to know the deep internals of not only how it works, but why. In the past 13 years of working in technology, I’ve found that learning the why is vastly more valuable.
Which brings me back to why I love working with Unix systems again. I understand why they act the way that they do, I understand the nature of the behavior. I find the layout of the filesystem to be elegant, and a minimally configured system to be best. I know that there are a lot of problems with the FSH, and I know that it’s been mangled more than once, but still. In Unix, everything is configured with a text file somewhere, normally in /etc, but from time to time somewhere else. Everything is a file, which is why tools like lsof work so well.
Yes, Unix can be frustrating, and yes, there are things that other operating systems do better. It is far from perfect, and has many faults. But, in the end, there is so much more to love about Unix then there is to hate.
Slowly Evolving an IT System
We are going through a major migration at work, upgrading our four and a half year old IBM blades to brand spanking new HP BL460 G6’s. We run a web infrastructure, and the current plan is to put our F5’s, application servers, and databases in place, test them all out, and then take a downtime to swing IPs over and bring up the new system. It’s a great plan, it’s going to work perfectly, and we will have the least amount of downtime with this plan. Also… I hate it.
The reason I hate it has more to do with technical philosophy then with actual hard facts. I prefer a slow and steady evolution, a recognition that we are not putting in a static system, but a living organism who’s parts are made up of bits and silicone. What I’d like to do is put in the database servers first, then swing over the application servers, and then the F5, which is going to replace our external web servers and load balancers. One part at a time, and if we really did it right, we could do each part with very little downtime at all. However, I can see the point in putting in everything at once, you test the entire system from top to bottom, make sure it works, and when everyone is absolutely certain that all the parts work together, flip the switch and go live. But… then what.
What about six months down the road when we are ready to add capacity to the system, what about adding another database server, what about adding additional application servers to spread out the load, what about patches?
Operating systems are not something that you put into place and never touch again. IT systems made up of multiple servers should not be viewed as fragile, breakable things that should not be touched. We can’t set this system up and expect it to be the same three years from now when the lease on the hardware is up. God willing, it’s going to grow, flourish, change.
Our problems are less about technology, and more about our corporate culture.
Systems Administrator
From time to time I’m asked by members of my family or friends of mine outside the tech industry what it is that I do for a living. When I respond that I’m a sysadmin, or systems administrator for Linux and UNIX servers, more times than not I get the “deer in the headlights” look that says I may as well be speaking Greek. So, for a while, I’ve taken to saying “I work in IT”, or “I work with computers, really big computers” or even “I’m a computer programmer”, which isn’t exactly accurate. Although I do write scripts, or even some moderate perl, I’m still not officially a programmer. I’m a systems administrator, so, let me try to explain, my dear friends and family, what it is I do in my little box all day.
First, some basics, let’s start at square one. Computers are comprised of two parts, hardware and software. Sort of like the body and soul of a person. Without hardware, software is useless, and vice-versa. The most basic parts of the hardware are the CPU, which is the brain, the RAM, which is the memory, the disk, which is a place to put things, and the network card, which lets you talk to other computers. For each of these pieces of hardware there needs to be some way to tell them how to do what they are intended to do. Software tells the hardware what to do. I forgot two important pieces of hardware: the screen and the keyboard/mouse. They let us interact with the computer, at least until I can just tell it what to do Star Trek style.
Getting all of these pieces of hardware doing the right thing at the right time is complicated, and requires a structured system, along with rules that govern how people can interact with the computer. This system is the Operating System (OS). There are many popular operating systems: Windows, OS X, and Linux are the big three right now. The OS tells the hardware what to do, and allows the user to add other applications (programs) to the computer.
Smaller computers, like your home desktop or laptop have network cards to get on the Internet. The network card will be either wired or wireless, that doesn’t really matter. When you get on the Internet, you can send and receive information to and from other computers. This information could be an email, a web page, music, or lots of other media. Most of the time, you are getting this information from a large computer, or large group of computers that give out information to lots of home computers just like yours. Since these computers “serve” information, they are referred to as Servers.
Large servers are much like your home computer. They have CPU, RAM, disk, etc… They just have more of it. The basics still apply though. Servers have their own operating system, normally either Windows, Linux or UNIX. Some web sites or web services (like email) can live on lots of different servers, each server having its own job to do to make sure that you can load a web page in your browser. To manage, or “administer” these servers is my job. I administer the system that ensures the servers are doing what they are supposed to do. I am a systems administrator. It is my responsibility to make sure that the servers are physically where they are supposed to be (a data center, in a rack), that they have power and networking, that the OS is installed and up to date, and that the OS is properly configured to do its job, whatever that job may be.
I am specifically a UNIX sysadmin, which means that I’ve spent time learning the UNIX interface, which is mostly text typed into a terminal, and it looks a lot like code. This differs from Windows sysadmins, who spend most of their time in an interface that looks similar to a Windows desktop computer. UNIX has evolved into Linux, which is more user friendly and flexible, and also where I spend most of my time.
Being a sysadmin is a good job in a tech driven economy. I’ve got my reservations about its future, but I may be wrong. Even if I’m not, the IT field changes so rapidly that I’m sure what I’m doing now is not what I’ll be doing 5-10 years from now. One of these days, maybe I’ll open a coffee shop or a restaurant, or I’ll finally write a book.
JeOS
For better or worse, we are starting to put Ubuntu JeOS images into production in our network. Starting off, we will only put these systems in for our non-IBM services, no WebSphere or DB2, as IBM doesn’t officially support this configuration yet, but for everything else, JeOS looks like a perfect fit.
The Sorry State of Enterprise Software
I’ve been unlucky enough to be working with quite a few pieces of so called “enterprise” software, the worst of which I’ve been working with lately is called the Tivoli Workload Scheduler. TWS is, at its core, a glorified cron. It is a scheduler, you can create jobs, or scripts, and have them executed at given times. You are supposed to be able to cascade jobs, and create dependencies between jobs. This is all well and good, but there are some serious problems with this software.
The first problem is the price. List price for TWS is $33/value unit. IBM bases its pricing scheme on how many CPU cores are in the server that you install their software on, 100 value units per single core CPU, and 50 value units per core for dual or quad core CPUs. So, if you have four servers, and each server has four quad-core cpu’s in them, that comes out to around $26,400. I think we just went ahead and bought 1000 value units up front. That’s a fairly good sized amount, and that does not include the cost of the consultant its going to take to install, configure, and actually use the software.
Why tie the cost of the software to the number of cores in the system? TWS doesn’t use CPU resources to actually do any work, it passes off the work to other applications, TWS simply schedules them to be run. The price would almost be bearable, if the software actually worked. For $26,000 I’d think that it ought to make me coffee and pancakes in the morning. The reality is that after several months of enduring the software, it still doesn’t work properly.
The end user of the system has been trying to add event rules that fire off an email if a job doesn’t end correctly. Wow, that’s like, what… one line of shell script? But, since this is the TWS, we have to put in a call to IBM. IBM will call back, and ask for a ton of information. They’ll ask for directories that don’t exist, ask you to run commands that may or may not work, and generally take up a lot of time. Meanwhile, I’m starting to think that we are actually beta testing this software for IBM, and they just didn’t bother to tell us.
And then there’s the user interface. The UI, like many IBM applications, is quite obviously built on Java, evidenced by the length of time it takes to launch. Once it is launched, there are cascading left to right areas of a single window that allow you to perform separate tasks. At $work, I’ve got a 22” monitor, and this is the only application that I expand to full screen. It needs it. The application, called the “Job Scheduling Console” provides it’s own tabbed MDI interface. It is extremely confusing. Part of the confusion is that evidently the developers decided that there were too many options in the man application window, and chose to add a second interface to TWS through it’s integrated WebSphere application server. The second interface, also Java, is accessed through a web browser. Unfortunately, not just any web browser, it seems to only support Internet Explorer. I tried to access it first through Chrome, which did not work at all, and then through Firefox, which almost worked, but there were pieces of the application missing. IE worked well. The web interface is just as jumbled as the fat client on the desktop. Buttons seemingly randomly placed, some options hidden in drop down menus and others placed either above or below the data.
There is no clear, obvious method to accomplish anything with this user interface.
And that is not all my friends, oh no, that is not all. You must also have access to the command line on the server where TWS is installed. Even on the command line TWS is not a good citizen. There is no man page or online help shipped with the application, you have to load a ton of special environmental variables, and they provide scripts that launch a faux-shell that only accepts certain commands. One such command, conman, offers the ability to view the logs in real time (why, for the love of God, do you not log everything to syslog?), but only if you enter the command “con se” at the conman prompt. Also, you should enter “lev=4” to make sure you get all the logs. Proper logging in an application can be a lifesaver, and it could have been an area where TWS could redeem itself somewhat. That is not what has happened. The “con se” command only works sometimes. Other times it simply says that it submitted that command to be processed and returns you to your prompt. Great, thanks… so where’s my logs?
Having multiple interfaces to the application is fine, if you could accomplish everything needed in any one interface. However, that is also not the case. You need all three, and the end user must switch between the web interface and the fat client, and I as the administrator must switch between the web client, the fat client, and the command line to try to coax this monster into doing what it is supposed to do. Which is… schedule jobs. That’s really all this is supposed to do, schedule jobs to run. I don’t think it should be this hard.
Take these points into consideration in the light of the cost of the application. Now, let your jaw slowly close and realize that IBM can charge this much because it has found a market that no one else is tapping. TWS is only one example of horrible “enterprise” software, there’s a lot more of it out there. Personally, I see an opportunity here. An opportunity for well thought out, beautifully crafted software that works well, is easy to use, and gets the job done.
AutoYast
I wrote this last year and never posted it. I’m glad I found it and can post it now.
Nagios Check Scheduling
Or, maybe a better title for this would be “They rebooted the server, why didn’t I get a page?” I’ve had that question asked of me a few times, and I’ve never had a good answer, so I thought I’d take a closer look at Nagios and see what is going on.
End of an Era
The hard thing about keeping a job in the technology field is that it is constantly changing. Just this past summer $WORK fired several mainframe workers who could not keep up. They got stuck on one technology that they knew how to operate, and failed to evolve when the field did. Now I think its clear that another sector of the job market is on its way out, the one that I, and thousands of others occupy, the job title of systems administrator.
System P
The more I learn about IBM’s P-Series UNIX systems, the more impressed I am. I’ve been a very harsh critic of them in the past, but that may have just been my ignorance of the platform. The P is, no doubt, expensive… however, when you look at what it can do, and at how many x86 systems you’d need to do the same thing, the P begins to justify its cost.
As an example, we are looking at building a new web hosting environment off of WebSphere. To accomplish this, we are looking at four database servers (DB2), and between six and eight application servers. The total cost for the project, not including the F5 switch, I’d imagine to be somewhere around $100,000. With that money, we could purchase one P-Series that would do everything we need one one box. That equates to less cabling, less administration, less network overhead, and a smaller footprint for the PCI auditors. One box, maybe four Logical Partitions (LPARs), and that’s it.
AIX, IBM’s version of UNIX, is another big win for the P-Series. Creating a m ksysb creates a bootable DVD clone from a running system. So, you can clone an LPAR and install it along with all the applications you have installed on a new P-Series. Very impressive, and I wish more systems had this feature built in. AIX has is peculiarities. SMITTY, the administration interface, is confusing and difficult to navigate, and expanding a logical volume on the fly requires more steps than I think should be necessary. Many of the shortcomings of AIX can be solved by installing the AIX Toolbox for Linux, which includes a lot of the basic Linux tools compiled for AIX. Like bash… I can’t live without my tab- completion and vi keyboard bindings! On the whole, AIX is an extremely stable operating system. Configuration is more complex than other systems, but once it’s set up, you can let it run for years without intervention.
I’ll be getting more in-depth with the a P550, P561, P570, and one more I’m not sure of the model number of. The next couple of months should be interesting.
The Linux Box and Upgrading Java
As a general rule, I really don’t like to go outside of the box when it comes to Linux. And by that, I mean that I don’t like going outside of what is provided by what ever distribution you are using, be that SLES, Red Hat, or Ubuntu. A lot of people put a lot of work into making sure that the packages that are available for the distribution actually work in the distribution and do not interfere with any other apps. Linux will let you do what ever you want, but just because you can do something, doesn’t mean that you should.
Going outside the box can have disastrous results with Linux. Back in early 2000 and 2001 when I was installing SuSE and Mandrake on my old IBM box, I wound up in dependency hell more than once. If you’ve never been there, it goes something like this:
OK, I want to upgrade my music player to the latest version, so I’ll download the latest RPM. Wait, that failed, because it depends on a newer version of some library file that I don’t have, so I’ll go search the Internet and try to find that. OK, found it, downloaded the rpm, and it failed to install because it depends on a newer version of some other library file that I don’t have. Looks like there’s no RPM for that library, so I’ll download the source code and compile it. OK, ./configure; make; make install; Nope, that failed because of a gigantic list of dependencies that are not available! At this point, you have to make a decision: Do you go ahead and find the dependencies, or do you give up and have a drink instead. If you choose to go ahead, you download the source to a dozen different packages and install them, then compile your library, then compile your other library, then go to install the rpm to find that it fails because one of the applications you upgraded along the way is, get this, too new to support your music player, and the install still fails. Oh, and by the way, half of your other apps that used to work, don’t work anymore.
Agility
To create the perfect datacenter, what would you recommend? For me, the perfect datacenter would be based on agility. We would be able to add new capacity when needed, and reallocate resources whenever needed, quickly and easily. We would be able to backup everything, securely and easily, off-site. We would use, whenever possible, open source software so we would not be constricted by licensing schemes. Would we have a SAN? Yes, most likely something very simple to administrate, like a NetApp. We would boot from the SAN, have no moving parts in the servers themselves, so we would have very few hardware failures. Whenever possible we would keep to one style of hardware, ie: all blades, or all 1U rack mounts, etc…
License Restrictions
Software licensing is one of the biggest expenses of high-end server systems. The vendors charge you not only to use the software, but they charge you for how efficiently you want to use the software as well. IBM, for example, charges a different license fee for AIX determined by how many cpus are in the system. So, to scale in response to load, weather its up or out, you have to pay for additional hardware, and then you have to pay for the ability to use that hardware. We are not talking small numbers here either, we are talking in the upwards of six figures 1, in addition to the cost of the hardware. In addition to that, if you are using proprietary applications on top of the OS, you are going to have to pay additional licensing fees for those as well. WebSphere in particular charges on a per cpu basis.