December 31, 2004

Ops, Time Management, and the proper use of laze

This is actually about Operations, not Code. I apologize for offkilter categories. I'm trying to resist the temptation to balloon the number of categories out to fit what I feel they should be, as I think that would approach roughly the number of posts in the blog.

Hi ho.

Anyhow, on the meat of it.

I was recently involved in a bit of a disagreement with a cow orker ( moo) about the proper way to go about 'fixing a problem' at work. I am an Op. What does that mean? Well, that's part of the problem. Since this is my blog, I get to pontificate for a moment. When I say I'm an Op, I mean I am responsible for Operations, in the computer and information technology sense, for my firm. I do not mean simply that "I work for IT." That's not specific enough. I do a strange mix of jobs that doesn't fit well in a large corp, and fits much more neatly at a small dotcom - I do whatever is required to keep things running at the corporate level if it involves computers and infrastructure. While this can include fixing the CEO's secretary's laptop, you'll have to convince me that the CEO's secretary's laptop is damn well mission critical, unless I signed on to do end-user support when you hired me...and believe me, you're paying me more if I did.

The problem with being an Op is that it is extremely difficult to explain to those who are not Ops what I do with my time, and how I prioritize it and parcel it out. Please note: this is not an attempt to pretend that I am not, at some level, lazy. Far from it. In fact, I am strongly of the belief that at the core of every truly inspired Op is a tightly held and well-managed streak of pure slothfulness, which when properly channeled can produce genius of time-fu. I shall explain. And no, I don't have it, really; I can only approximate it.

I spend my work time performing three types of task - or rather, in three states of work. I shall call them

  • Routinized
  • Structured
  • Chaotic
Routinized tasks are pre-planned, scheduled tasks. Maintenance, in other words; expected, allocated time, used to perform regular tasks in a (usually) repetitive cycle. Backups, security audits, user account maintenance, inventory checks, system management e.g. OS upgrade installations (as opposed to testing), etc. etc. Anything and everything that you know needs to be done in advance, you've scheduled time for, and you can check a box off your task list for the day/week/month/whatever when you're done. Whack-a-mole type stuff.

These tasks are usually boring, but important. Stuff that cannot get skipped, or things stop working. It is the goal of all good Ops to attempt, on a continuous basis, to minimize the amount of time they spend on these tasks - through automation, through process management, and through careful sanity checking on their task lists. The reasons are manifold, but here are some of the most important. First of all, the more of these tasks you have to do, the more time each routine 'cycle' takes - and the less time you have for other types of tasks, as I'll outline below. Second, the more tasks there are, and the more complex they are for humans to carry out (i.e. the more manual they are) the easier it is for mistakes to be made, steps to be forgotten, or subtasks to be missed - resulting in bad state which then causes Problems. Problems are Not Good, and result in Chaotic Tasks. Third, we're Ops, and we're lazy - we don't like working on boring things. It's not fun and makes us irritable, and you won't like us when we're irritable.

With me so far? Okay, good. This brings us to our next type of task - the Scheduled task, which occupies scheduled time. These are 'one off' tasks which are not routine, but are usually preplanned, known outcome tasks with deadlines. While they can be interrupted, doing so exacts a cost in both 'shutdown' and 'startup' costs for the task, as well as in resources (lab space, spare/swap machines, etc.) needed to maintain potentially fragile 'middle states' of the task. Some examples of this include hacking/scripting to develop new automation, configuring machines to perform new duties, configuring new machines to replace older servers in a preplanned switchover, implementing new services on existing servers, implementing test plans for new services/software, etc. In short, the expected use of time to carry out more demanding work with less tolerance for interruption, but which does not represent a 'regular task list.' Time spent in Routinized tasks, naturally, cannot be used for Scheduled tasks, and vice versa. So this is a primary reason to avoid overburdening Ops with Routinized task loads.

Now, if these were all an Op had to worry about (like, say, if s/he were a coder, or a web developer, perhaps) then things wouldn't be too bad. Nice Gantt Charts could be drawn, timeslices could be set up, and spreadsheets could be done showing managers precisely what was being done when. The problem is that this isn't what Ops do. This brings us to the third type of task and time.

Sometimes, a server gets compromised at 2 am. Sometimes the mailserver eats its disk array at noon on saturday. Sometimes for whatever reason, somebody needs a new public-facing machine up in 5 hours, and even the Op is forced to agree they may have a point. This is a chaotic task. I call it this because even in the case of building a new machine, being forced to do so with no advance warning usually introduces enough uncertainty (what machine are we using? Do we have enough RAM for that? Are the disks OK? Do we have drivers for that? Well, will it fit that rack? Oh, they want SLES, but it's a Dell 2650? Well, we have to run RHEL then, because we don't have RAID drivers for SLES for that box...etcetera, etcetera) that claiming these are preplanned processes is a bit of a joke. In the case of a full recovery, making time predictions may even be criminally stupid.

So there's the rub. You have three different types of demands. You want to make sure that when chaotic events, or tasks, arise, they'll get done - because they are almost always mission-critical for somebody or other. You need to make sure that the routinized tasks get done without too much interruption, or things start breaking. You need to make sure that the scheduled tasks get done in some reasonable timeframe, or you start falling behind in your ability to implement services and become a purely maintenance organization (and a poorly performing one, at that).

This will mean people will come in and want to know why you're reading mail rather than implementing the new mail server. One answer: Because I don't have enough of a scheduled time block to do so at the moment. Another: I'm on call at the moment, and can't get too deep into something that I can't drop.

Whatever the reason, it's true, there are limits. They're different for people and places, and they're hard to know without knowledge of the organization and people in question. But the next time you're told, as an Op, that you can easily fix something with a minor little tweak that requires semi-regular touching by a human, remember - if you say yes, you just added something small but recurring to your routinized workload - another thing that might get forgotten or missed if time runs short, and another thing that steals cycle time on a regular basis. Posted by jbz at December 31, 2004 1:37 AM | TrackBack

Post a comment

Remember personal info?