Upon investigation, we found that in all cases, the messages were showing up in mailman's list archives, indicating that they had been received by postfix on the list server, passed to mailman, processed by mailman, and added to the list archives in a timely fashion. So the hangup was somewhere either in mailman's outbound message processing or afterwards.
Looking in mailman's queue, we found that it was running and contained only seven or eight messages. Nope.
Moving along, we checked postfix on the list server. Whoa. Nine thousand messages. Using
mailqshowed us all of them. Although a couple of thousand of them were in various states which indicated 'normal' deferral, such as 'Connection timed out' or 'Host not found' or 'Mailbox full', several thousand had the following error message:
status=deferred (delivery temporarily suspended: unknown mail transport error)
Checked the process table. Postfix had multiple smtp agents running. Tailed the logs, and things were happening...but as we watched, messages flew by, deferring with that same error. We reloaded postfix, and confirmed that it reloaded its configuration - but no difference, same behavior. Okay, so postfix was functional, as far as it could tell us, but something was definitely wrong. Corrupt files? Perhaps.
We stopped postfix and had it run a check. It helpfully found the five or six known corrupt message files that it had already quarantined, announced that all of its permissions were correct, and returned. We started it back up. This time, mail began to flow properly. Looking in the mailqueue, however, many of the messages were still marked as deferred with that error. We manually flushed the queue (postqueue -f) and, watching the logs, saw them flood back into active status. A few minutes later, our list mail folders filled with backlogged list mail.
A few hours later, it happened again.
Then a couple of hours later, again.
I investigated further. Looking in the logs, I found the moment the first message came back as deferred. A few seconds prior to this, I found a warning:
postfix/qmgr 9083: warning: premature end-of-input on private/smtp socket while reading input attribute name
Going back to the prior incidents, I found similar but not identical errors a few seconds prior to them as well. In one case, I found a panic:myfree error warning of corrupt or unallocated memory. In each incident, however, what apparently happened was that one of the smtp agents had crashed and the qmgr (queue manager) had caught wind of that fact. The process ID number given in the warning message by the qmgr (9083 in the example above) was the proc number of the agent that had died, so by backtracking in the logs I could figure out which message ID the agent in question had been handling when it stopped reporting.
The problem was that when I restarted postfix, I found that in every case, those same messages that were in process during the crashes seemed to go out just fine. So it wasn't a case of a malformed or corrupt message, which is what had always been the cause of such behavior in my experience with postfix up to this point.
At this point, I had a problem. Well, two problems, potentially three. Problem one: an unknown factor was causing the intermittent but frequent crash of smtp agents on my lists machine. This by itself was merely annoying, given that postfix's master process was properly figuring this out, reporting it, and spawning new agents. However, it was rendered critical by problem two: Postfix, for unknown reasons of its own, upon discovering the crash of one of these agents, would start marking large numbers of messages in the queue as 'undeliverable.' Potential problem three: As far as I could tell, there wasn't any attempt later on to revisit those messages marked as such.
I'm not going to even try to address why postfix was behaving that way. Suffice to say it pissed me off immensely.
I attempted the almighty Google search. I found a couple of hits on the errors messages I was receiving. The answers, from the Postfix coxers themselves, seemed to fall into two categories. One: this kind of problem could be introduced when using mail scanner or routing scripts that directly touched the mail queue. Okay, not me; I'm using mailman but only in a manner prescribed, i.e. postfix delivers to mailman and mailman uses postfix's available tools to inject messages back into the queue. Two: Memory or hard drive troubles on the server. Okay, fair enough, I'll check.
Ran full hard drive checks. Nothing. Just for giggles, took the system drive of the server out of mirror mode, ran it on each of the mirrors individually. Showed the problem both times. Put it back into (PERC hardware) mirror. I consider that to come as close as I can to eliminating hard drive corruption as a cause. To be safe, I disabled swap on the machine in case there was a problem with the swapfile, and ran it on physical RAM. Nope, problem still showed up. Okay. Problem isn't in swap disk.
What about RAM? Downed the server, and ram memtest86+. Note that this program doesn't really like PowerEdge 2650s - it shows a constant error on one address in high RAM which I suspect is used by the ServerWorks management hardware. Other than that...nothing, really. Decided to be safe. Ordered new DIMMs from Dell. Installed. Nope. Same problem. I consider that, again, to essentially rule out the 'bad RAM' problem.
Finally, I was forced to do what I didn't want to do. I upgraded postfix past the official Red Hat release, to 2.2.5.
It's been 12 hours, and it hasn't crashed an agent once, nor marked any of the mail as temporarily suspended.
I'm a little annoyed about this, and here's why.
First of all, the postfix crew were adamant that this wasn't postfix's fault. I understand their bias, but, um, heh. I'll let that one go. I wasn't using the 'current' version anyway.
What really pisses me off: Why the hell did postfix 'handle' a crashed smtp agent like that anyway? How am I to know it won't *still* do that? Please don't tell me to go read the code. I'm an op. I'm not a software engineer. If the only way I can be reasonably sure your software will work the way it should is to read the code, because testing a prior version has resulted in behavior I can't explain and this behavior hasn't been addressed in docs, then there's a problem. Back to the point: Why does a crashed smtp agent result in messages in the queue being flagged as undeliverable? The entire reason for there being a watchdog process and multiple smtp agents (well, one reason) is so that one agent dying shouldn't be able to tank mail delivery as a whole.
Anyway, just another day as an Op.
Posted by jbz at September 20, 2005 1:20 PM