Off-site status information for cornerhost.com.

1/11/2006

30 second timeouts, part 2

This is a follow-up to my earlier post about the 30 second timeout rule and its implications for programs like movable type.

Thank you, everyone who offered feedback. I read every reply and tried to address every point raised. I'm working from notes here and responding to a lot of issues at once, so forgive me if I don't credit everyone who offered an idea individually.

clarifications

First of all, I'm not trying to ban MT, and I don't ever want to be pushing people from one CMS to another every time there's a problem. Banning MT would be disastrous for cornerhost's financial prospects, so it's not at all what I was trying to say. MT is simply the most popular piece of software that routinely conflicts with the 30 second execution rule that happens to be very effective at protecting the servers.

Nor did I mean to imply that MT wasn't mature. I just meant that there are other blogging tools that are also mature and don't have MT's particular problem. Wordpress blogs get just as much comment spam as MT blogs, and as Chuck Welch pointed out, there are hosts out there that have had to ban wordpress sites for hogging resources.

how MT degrades

The scalability problems in MT and wordpress are very different. MT uses static html files so that it's easy for thousands of people to view the same page at once. Wordpress (or MT in dynamic page mode) has to make a database connection for each page view and then execute a bunch of database queries, so a massive influx of traffic can tie up the database and really slow things down.

Mt degrades in a different way. Because all the pages on an MT site are rendered as html files, then pages need to be rebuilt every time something changes. For example, if someone post a comment on an entry, then the front page and any archive or category page that shoes that entry has to be updated, because those pages all show the number of comments at the bottom of the entry. The more comments and entries there are, the longer this takes. However, MT often runs much slower than it should, even accounting for these requirements. For example, someone told me last night that the time required for posting a comment to his site went from about 4 seconds to about 40 over the course of a couple days.

If you think about the graph of an exponential function, this makes sense. Things run smoothly, almost flat for a while, and then abruptly curve upward. I haven't jumped into the MT source code to prove this, but I suspect that the comment posting algorithm in MT displays this behavior: you grow too big and suddenly you hit a tipping point, and suddenly your site takes forever to do anything. Like I said, I haven't proved this, but it seems to fit the data.

Comment spam + Slow MT = dead server

Then along comes a spammer, ready to post 500 or even just 5 comments to a blog. Since the spam comes in faster than a "slow" MT site can deal with it, the mt-comment.cgi and mt-trackback.cgi processes pile up in memory. And remember, these aren't little tiny CGI scripts any more. Each one has inflated into a server-crippling resource hog of it's own.

Each server at cornerhost has Dual Xeon 2.8 GHz processors. These are FAST machines. But a program with exponential complexity can tie it up just the same. Think of the exponential graph as representing the amount of hardware resources necessary to complete the task: it shoots up so fast that it doesn't matter how powerful your computer is.

Not every MT site is affected

I haven't looked through the MT code to prove that the algorithm they use has this problem, that's just my theory. Some sites, even older ones, haven't tipped, and may not ever tip. I don't have enough data to nail down the cause. All I know is that SOMETIMES mt goes nuts, and then even moderate comment traffic can take down a server. I usually call these comment spam attacks, but of course it can happen even with normal commenting by real users.

Six Apart is not very cooperative

Someone made the suggestion of working with Six Apart to solve this problem. This is troublesome. If you look at their license, they are somewhat hostile to any kind of outside commercial support. For example, a host is not permitted to preinstall MT for users (only Six Apart can do that) and of course they are actively competing against hosting companies with Typepad. They're aware that MT has this problem. I really can't offer any explanation as to why it still happens in even the latest versions. Either they want to fix it but don't know how, or they don't want to fix it. And since MT is proprietary software, nobody else can (legally) fix it either.

But I really don't hate MT

MT causes me more problems than just about anything else, but I don't really hate it or want to ban it. For the most part, the problem is manageable. I much prefer it when people use other tools, but I understand that people like MT and want to use it and for the most part I'm happy to host MT sites, as long as I can keep the server running.

MT speedups and workarounds

There are some ways to improve MT performance. They have a paper up here:

http://www.sixapart.com/pronet/articles/how_to_speed_up.html

I'm not convinced these solutions address the core problem (the algorithm) but they can help with the symptoms.

Enable background tasks

Some people have told me that enabling background tasks (see the "how to speed up" link above) prevents their sites from running up against the 30 second timeout. I suspect that while the pages appear to run faster, the background tasks may still pile up. However, I haven't had a chance to really look into this yet. In any case, it does seem to improve user experience and should prevent duplicate comment attempts due to timeouts.

FastCGI?

Someone asked about FastCGI.

FastCGI keeps the mt process in memory deliberately, so as to avoid launching a new copy of perl with each page request. If the core problem is (as I am suggesting) an exponential-time algorithm, then shaving a few milli-seconds off the startup for perl wouldn't help much. It can only offer a small, linear improvement, and again the servers are very fast to begin with.

However, I am considering offering FastCGI for other reasons people who use certain web frameworks keep asking for it, so I may offer this at some point in the future, as an add-on service or new account type.

The 30 Second Rule

Let me explain the 30 second rule.

The way it works is there's a script called vengeance that sits in memory, continually monitoring all the processes running on the machine. It knows the difference between a program you're running through the shell (pine or emacs or vim) and one that's being run from cron or apache.

If a script is older than about 15 seconds or so, vengeance will renice it, which means give it a lower priority in the operating system. Then at the 30 second point, the process is simply killed.

For a long time, I turned off the "kill" part. Why? Because so many people complained that MT was getting killed. So for a time, all vengeance did was renice processes. Unfortunately, this just gives you a lot of low-priority scripts that still hog the server. I now whitelist mt.cgi itself, but the mt-comments and mt-trackback scripts still get killed.

The thirty-second rule isn't a cureall (plenty of other things can go wrong) but the servers run a whole lot better, for longer, when it's in place.

Is the rule too strict?

I chose the number 30 because that's what pair.com used to do back when I was a customer there, and pair's servers run pretty well. But the number is just an arbitrary cutoff point.

In general, it just shouldn't take 30 seconds for a high speed modern computer to put a comment on a blog.

Of course there are other scripts that time out, and some of these are harmless. Vengeance doesn't consider the actual resources used by the script.

Nor does it consider the environment. The degree to which a script is tolerated should vary depending on the server load, and other factors. If the server is really busy, it makes sense to shorten the timeout, and if the server is calm, then things can be allowed to run longer.

There's a field of technology called fuzzy logic that's very good at problems like this, and I'm planning to incorporate a simple fuzzy inference engine into a new replacement for vengeance. This is actually a lot easier than it sounds, and it should make the system much more lenient while still keeping the servers stable.

blacklisting spammers

Another approach that strikes at the symptom, but which could benefit a whole lot of people, is a blacklist for comment spammers. I've been hesitant to do this in the past because it's way too easy for the blacklist to be wrong. However, a good blacklist would certainly ward off comment spam, which would ease the burden of the slow MT sites. I haven't made up my mind about this yet, and am very open to feedback.

better analysis of incidents

One problem with vengeance is that it doesn't notify people that their scripts have been killed. It's actually pretty easy to identify a script's owner, even when it's running through apache without SuExec. I consider it a bug that nobody is notified when a script is killed. The data just goes into a logfile, which I look at occasionally, but the information could definitely be put to better use.

The only real obstacle is that I don't want to send someone 500 emails if one of their sites goes nuts. It makes more sense to filter this information automatically and tie it into some kind of ticket system. It's simply a matter of having the time and resources to make it happen.

and finally: rantelope

Even though I'm happy with diversity and will continue to let people host whatever software they want, I also know that I'd have a very strong competetive advantage if there was a tool that I knew inside and out. I'd be able to offer much more in-depth support, even down to the level of optimizing and debugging the code, and doing the initial installation and setup work.

Once upon a time, I decided to write a blogging tool called rantelope for just these reasons, but the project never got very far. It's been on the back burner ever since, and is certainly not ready for prime time.

However, lately I've been thinking more and more of the idea of building rantelope by porting and extending an existing tool (probably wordpress or drupal). This would give me the chance to understand each part of the system without having to create it from whole cloth.

coming soon...

All of this is a lot of work, so it's going to take a while to make it happen. I'm still working out what's coming first, and how all this balances with the many other projects on my plate. In the meantime, at least for a while, the 30 second rule is staying put.

Blog Archive