Off-site status information for cornerhost.com.

3/01/2008

the queues look good

Things seem to be running smoothly again with the mail queues. Pretty much everything has been delivered.

Some people reported getting copies of mail they had already received. I'm not sure why the duplicates were in the queue, but now that things are cleared out, that shouldn't happen anymore.

I was working on a queue management script as I went along, and have dramatically improved my ability to deal with this kind of problem in the future. However, I'd much rather prevent it from happening in the first place, and I think one of the most important steps I can take here is to change the policy of enabling catchall rules by default, and very likely preventing the catchall addresses from being forwarded. I'm still working out the details on that, but expect to hear more soon.

2/29/2008

mail queue triage

The queue size and server load on manganese are under control, but I am still struggling with the massive backlog of undelivered mail.

There are several broad categories of mail to deal with, in order of importance:

  • mail that goes to a local mailbox that is checked on a regular basis.
  • mail that is forwarded to external mailboxes
  • mail that goes to a local mailbox that is rarely or never checked

The first and third categories (local mail) should be the quickest to process, but unfortunately they are mixed in with mail that needs to be forwarded.

Right now, I'm building a report of "last read" dates for each mailbox so I can separate out users who check their and move local mail for these users to the front of the queue.

The next step will be to sort forwarded mail by destination server, so that I can uncover any specific services that are slow to accept mail and deal with them appropriately.

The idea is for this to be a standard policy going forward, so (for example) if you are currently forwarding your mail to gmail or yahoo, you will get much faster service by having the mail delivered to a local box and having the outside service poll that box through pop3.

2/28/2008

manganese queue again

Well, the mail queue on manganese grew to 6000 entries again overnight.

Things are slow on the server, but I'll have a report within the hour with a "top 20" list of domains with the worst problems.

I can already see at least three accounts that got a ton of spam trapped in the queue from last night, so I'm going to deal with those now while the reporting system does its thing.

2/27/2008

mail queue status

I'm currently breaking mail queues on all machines into smaller queues of 500 to 1000 messages each. This is proving to be a rather long process, and may take several more hours.

Manganese in particular had over 50,000 messages in the queue. Vanadium and mercury both have about 20,000 - though these messages appear to be mostly old spam that has piled up over time.

My plan is to have this queue-splitting process running periodically from now on so that slow mail (such as mail to unreachable hosts) automatically gravitates to the secondary queues and sending can be retried less often. This in turn should speed up delivery of healthy mail since the server has less to do, and also free up CPU resources for everything else.

I'm just letting the queue splitters run for now. There's not much else I can do until they finish.

manganese update

The mail queue on manganese is still having troubles.

Part of the problem is that as the queue gets large, it takes longer and longer to process it. Sendmail itself has this problem, but it also affects my ability to find, select, move, and delete messages.

Sendmail stores each message as up to three separate files, each of which has to be parsed. Simply finding all messages to a particular domain, or running a report to see what domains are getting the most messages can take over an hour.

Normally, the queues are quite small so this isn't a problem, but when the queue gets big, it can take a long long time to fix things just because everything is so slow.

I'm just as sick of dealing with this as you guys, so I'm updating my queue management script to caches all the data in a sqlite database. This will make it much easier for me to inspect the queue, and dramatically speed up batch commands like creating secondary queues.

As for what's actually causing the problem on manganese, it looks like it might be a combination of rejected mail forwards (people forwarding all their mail to another server) and a possible bottleneck when it comes to spamassassin.

I'm isolating the blocked forwards into separate queues for further analysis as I go along (again this is very slow), and I've also increased the number of spamassasin demons.

The box should be back up to speed by tomorrow. It may be a day or two before the mail clears out in the secondary queues. (I'll let you know if I isolate your mail.)

Blog Archive