IPv6 bites again

April 10th, 2015 by

Every now and again, one of our users will either get their SMTP credentials stolen, or will get a machine on our network compromised. More often than not, the miscreants responsible will then proceed to send a whole bunch of adverts for V1@gr@ or whatever through our mail servers. This typically results in our mail servers getting (not unreasonably) added to various blacklists, which affects all our users, creates work for us and generally makes for sad times.

We’ve got various measures to counter this, one of which relies on the fact that spam lists are typically very dirty and will generate a lot of rejections. We can use this fact to freeze outgoing mail for a particular user or IP address if it is generating an unreasonable number of delivery failures. The approach we use is based on the, generally excellent, Block Cracking config.

Unfortunately, both we, and the author of the above, overlooked what happens when you start adding IPv6 addresses to a file which uses “:” as its key/value separator, such as that used by Exim’s lsearch lookup. Yesterday evening, a customer’s compromised machine started a spam run to us over IPv6.

Our system raises a ticket in our support queue every time it adds a new IP to our block list so that we can get in touch with the customer quickly. Unfortunately, if the lookup doesn’t work because you haven’t correctly escaped an IPv6 address, it’ll happily keep adding the same IP for each spam email seen, and raising a new ticket each time. Cue one very busy support queue.

Needless to say, the fix was simple enough, but the moral, if there is one is a) test everything that you do with both IPv6 and IPv4 and b) start preparing for IPv6 now, as it’s going to take you ages to find everything that it breaks.

Code making assumptions about what an IP address looks like that will be broken by IPv6 are almost certainly more prevalent than 2-digit year assumptions were 15 years ago.

Helping RachelPi

March 4th, 2015 by

Some time ago we were forwarded a plea by Liz Upton who’s sort of famous on the internet for some sort of cheap computer, on behalf of World Possible, which said

This brings us to good news / bad news.  Last month we pushed through 5TB of
FTP traffic, and over 20TB of FTP traffic on the year.  That's great, about
700 RACHEL downloads - but our web host isn't as excited with our success
and cut us off yesterday.

Liz thought this was the sort of thing we might be able to help with. So we got in contact and we’ve set them up with one of our older inexpensive servers to act as a new host. As it’s an educational project that we’d like to support; we thought we’d donate some bandwidth to help out. Since it nicely coincided with a substantial bandwidth upgrade in our Cambridge data centre we’d put the service there.

So far they seem pleased!

which is handy because some of their other suppliers who pay Amazon rates for bandwidth were a little bit annoyed with them.


WP Super Cache vs Raspberry Pi 2

March 3rd, 2015 by

On Monday, the Raspberry Pi 2 was announced, and The Register’s predictions of global geekgasm proved to be about right. Slashdot, BBC News, global trending on Twitter and many other sources covering the story resulted in quite a lot of traffic. We saw 11 million page requests from over 700,000 unique IP addresses in our logs from Monday, around 6x the normal traffic load.

The Raspberry Pi website is hosted on WordPress using the WP Super Cache plugin. This plugin generally works very well, resulting in the vast majority of page requests being served from a static file, rather than hitting PHP and MySQL. The second major part of the site is the forums and the different parts of the site have wildly differing typical performance characteristics. In addition to this, the site is fronted by four load balancers which supply most of the downloads directly and scrub some malicious requests. We can cope with roughly:

Cached WordPress 160 pages / second
Non cached WordPress 10 pages / second
Forum page 10 pages / second
Maintenance page at least 10,000 pages / second

Back in 2012, during the original launch, we had a rather smaller server setup. That meant we simply just put a maintenance page up and directed everyone to buy a Pi direct from Farnell or RS, both of whom had some trouble coping with the demand. We also launched at 6am GMT so that most of our potential customers would still be in bed, spreading the initial surge over several hours.

This time, being a larger organisation with coordination across multiple news outlets and press conferences, the launch time was fixed for 9am on Feb 2nd 2015. Everything would happen then, apart from the odd journalist with premature timing problems – you know who you are.

Our initial plan was to leave the site up as normal, but set the maintenance page to be the launch announcement. That way if the launch overwhelmed things, everyone should see the announcement served direct from the load balancers and otherwise the site should function as normal. Plan B was to disable the forums, giving more resources to the main blog so people could comment there.

The Launch

turtlebeach

It is a complete coincidence that our director Pete took off to go to this isolated beach in the tropics five minutes after the Raspberry Pi 2 launch.

At 9:00 the announcement went live. Within a few minutes traffic volumes on the site had increased by more than a factor of five and the forum users were starting to make comments and chatter to each other. The server load increased from its usual level of 2 to over 400 – we now had a massive queue of users waiting for page requests because all of the server CPU time was being taken generating those slow forum pages which starved the main blog of server time to deliver those fast cached pages. At this point our load balancers started to kick in and deliver the maintenance page to a large fraction of our site users – the fall back plan. This did annoy the forum and blog users who had posted comments and received the maintenance page back having just had their submission thrown away – sorry. During the day we did a little bit of tweaking to the server to improve throughput, removing the nf_conntrack in the firewall to free up CPU for page rendering, and changing the apache settings to queue earlier so people received either their request page or maintenance page more quickly.

Disabling the forums freed up lots of CPU time for the main page and gave us a mostly working site. Sometimes it’d deliver the maintenance page, but mostly people were receiving cached WordPress pages of the announcement and most of the comments were being accepted.

Super Cache not quite so super

Unfortunately, we were still seeing problems. The site would cope with the load happily for a good few minutes, and then suddenly have a load spike to the point where pages were not being generated fast enough. It appears that WP Super Cache wasn’t behaving exactly as intended.

When someone posts a comment, Super Cache invalidates its cache of the corresponding page, and starts to rebuild a new one, but providing you have this option ticked…

supercache-anonymouse

…(we did), the now out-of-date cached page should continue to be served until it is overwritten by the newer version.

After a while, we realised that the symptoms that we were seeing were entirely consistent with this not working correctly, and once you hit very high traffic levels this behaviour becomes critical. If cached versions are not served whilst the page is being rebuilt then subsequent requests will also trigger a rebuild and you spend more and more CPU time generating copies of the missing cached page which makes the rebuild take even longer so you have to build more copies each of which now takes even longer.

Now we can build a ludicrously overly simple model of this with a short bit of perl and draw a graph of how long it takes to rebuild the main page based on hit rate – and it looks like this.

Supercache performance

This tells us that performance reasonably suddenly falls off a cliff at around 60-70 hits/second. At 12 hits/sec (typical usage) a rebuild of the page completes in considerably under a second, at 40 hits/sec (very busy) it’s about 4s, at 60 hits/sec it’s 30s, at 80hits/sec it’s well over five minutes. At that point the load balancers kick in and just display the maintenance page, and wait for the load to die down again before starting to serve traffic as normal again.

We still don’t know exactly what the cause of this was, so either it’s something else with exactly the same symptoms, or this setting wasn’t working or was interacting badly with another plugin, but as soon as we’d figured out the issue, we implemented the sensible workaround; we put a rewrite hack in to serve the front page and announcement page completely statically, then created the page afresh once every five minutes from cron, picking up all the newest comments. As if by magic the load returned to sensible levels, although there was now a small delay on new comments appearing.

Re-enabling the forums

With stable traffic levels, we turned the forums back on. And then immediately off again. They very quickly backed up the database server with connections, causing both the forums to cease working and the main website to run slowly. A little further investigation into the InnoDB parameters and we realised we had some contention on database locks, we reconfigured and this happened.

Our company pedant points out that actually only the database server process fell over, and it needed restarted not rebooting. Cunningly, we’d managed to find a set of improved settings for InnoDB that allowed us to see all the tables in the database but not read any data out of them. A tiny bit of fiddling later and everything was happy.

The bandwidth graphs

We end up with a traffic graph that looks like this.

raspi-launch-bwgraph

On the launch day it’s a bit lumpy, this is because when we’re serving the maintenance page nobody can get to the downloads page. Downloads of operating system images and NOOBS dominates the traffic graphs normally. Over the next few days the HTML volume starts dropping and the number of system downloads for newly purchased Raspberry Pis starts increasing rapidly. At this point were reminded of the work we did last year to build a fast distributed downloads setup and were rather thankful because we’re considerably beyond the traffic levels you can sanely serve from a single host.

Could do a bit better

The launch of Raspberry Pi 2 was a closely guarded secret, and although we were told in advance, we didn’t have a lot of time to prepare for the increased traffic. There’s a few things we’d like to have improved and will be talking to with Raspberry Pi over the coming months. One is to upgrade the hardware adding some more cores and RAM to the setup. Whilst we’re doing this it would be sensible to look at splitting the parts of the site into different VMs so that the forums/database/Wordpress have some separation from each other and make it easier to scale things. It would have been really nice to have put our extremely secret test setup with HipHop Virtual Machine into production, but that’s not yet well enough tested for primetime although a seven-fold performance increase on page rendering certainly would be nice.

Schoolboy error

Talking with Ben Nuttall we realised that the stripped down minimal super fast maintenance page didn’t have analytics on it. So the difference between our stats of 11 million page requests and Ben’s of 1.5 million indicate how many people during the launch saw the static maintenance page rather than a WordPress generated page with comments. In hindsight putting analytics on the maintenance page would have been a really good idea. Not every http request which received the maintenance page was necessarily a request to see the launch, nor was each definitely a different visitor. Without detailed analytics that we don’t have, we can estimate the number of people who saw the announcement to be more than 1.5 million but less than 11 million.

Flaming, Bleeding Servers

Liz occasionally has slightly odd ideas about exactly how web-servers work: 

is-this-thing-on

Now, much to her disappointment we don’t have any photographs of servers weeping blood or catching fire. [Liz interjects: it’s called METAPHOR, Pete.] But when we retire servers we like to give them a bit of a special send-off.

Happy Birthday to Raspberry Pi

March 2nd, 2015 by

Mythic Beasts has been supporting Raspberry Pi since we saw a small Atmel based prototype that Eben Upton was tremendously proud of and that we thought nobody would ever want. However, we’ve always been wary of betting against Eben and the fact we’re now providing enough bandwidth to download copies of N00BS considerably faster than we can make cups of coffee suggests that, much though it pains us to admit it, we might have been wrong and Eben might have been right.

On Saturday Pete went to join Raspberry Pi at their 3rd birthday party. It was a lot of fun. He drank beer brewed by a brewery controlled by Raspberry Pis, saw the magical RFID announcing machine declare Liz Upton ‘The Tyrannical Goddess of Time and Space’ which clearly had been set to maximum flattery mode. There was also a neat synthesiser with keyboards and a drum machine hooked up doing all the instrument synthesis on an original single core model B which resulted in this sort of Raspberry Jam:


ModMyPi also had a stock of quad core Pis meaning Pete was able to buy one in person for real money and skip the ordering delay on the ones he’s ordered online.

 

But mostly it was just great to see how far we’d come. At the original Raspberry Jam soon after launch in 2012 we met a lot of people who were exciting and fired-up with plans to do awesome things. Now lots and lots of awesome things have been done.

 

But I think it was Helen Lynn that summed it up best. She quietly said to me while surveying the amazing stuff in the room, ‘It really is loads better than when I was six’. Eben Uptons attempt to recreate the computers of his childhood in the 1980s has completely and utterly failed, it’s much cooler this time round.

Of Raspberries and Reptiles

February 17th, 2015 by

Steven Allain

On Sunday night Pete was in the Hopbine and while buying some drinks the bartender asked him about his Raspberry Pi t-shirt and if he knew anything about it. One of the hazards of drinking in Cambridge is the barstaff are often considerably more knowledgable than you might expect at first.

Steven not only sells beer but is also a student at ARU studying zoology and has been using a Raspberry Pi and camera to look into monitoring and photographing things under water with motion detection. He commented that he’d just bought a Raspberry Pi model B+ and only a couple of weeks later the much faster model 2 B had come out and he wished he’d bought one of those instead, but as an impoverished student he couldn’t really justify replacing it.

Now we think taking photographs of fish and reptiles is pretty cool, so Pete took pity on him and gave him his model 2 Raspberry Pi in exchange for a future promise of some photographs of underwater things taken with his setup.

Ultimately this gets back to the real reason Mythic Beasts support Raspberry Pi. Not because it makes it cheap to run a formal curriculum for teaching in schools, but because it’s a catalyst for people to teach themselves. Steven may or may not have success in making a motion detecting under water camera but either way he’ll learn a lot in the process.

The mistake in all this? Not checking the Raspberry PI stock levels and Pete realising it’s going to take a few weeks before the replacement model 2 arrives – he’s back to his old much slower model B+ now and grumbling about it.



We’ll settle for pictures of Sea Bass with frickin’ Laser Beams

 

 

Bandwidth Upgrades for Cambridge servers

February 16th, 2015 by

Taking a break from our usual articles about upgrades for VPS customers and mocking the hopelessly incompetent, we’d like to announce an upgrade for dedicated and colo customers in our Cambridge data centre. We’ve finally completed the upgrade of both of our links into Cambridge, so have increased bandwidth quotas, and reduced excess rates to just 7p/GB.

Details of the new specs can be found on our Dedicated Server, Colocation and Mac Mini Colo pages.

Virtual Server performance boost

February 6th, 2015 by

cloud-cpuWe’ve just added an option to allow Virtual Servers to get full access to the CPU extensions available on the host server.

By default, virtual servers see a subset of CPU features that is available consistently across all of our hosts. For most users this has no impact on performance, but for some applications, such as performing certain types of encryption, speed can be substantially improved if certain processor extensions are available.

We’ve noticed significant improvements in OpenVPN throughput and latency after turning on this option on some of our servers.

CPU mode on our virtual servers can be configured using the “cpu” command on the admin shell.

Bring Your Own ISO

January 30th, 2015 by

Cloud CDROMOur Virtual Servers come with a virtual CD drive, allowing you to load an ISO image from our library and install an operating system of your choice, configured exactly how you want it.

We’ve just launched our “Bring Your Own ISO” feature, allowing you to upload your own ISO images, giving you complete freedom to install your choice of operating system, or to run a “live CD” distribution.

All users have a free 5GB allocation on our storage cluster for images, and files can be fetched from anywhere on the internet via HTTP, HTTPS, git, FTP or rsync.

Customers can upload a boot image via the “Boot Media” option on our customer control panel.

glibc 0-day exploit (GHOST), how we’re handling it

January 28th, 2015 by

 

I would like to introduce our all new female GHOSTbusting team to tenuously tie in with a new Hollywood movie and gratuitously include a cool staff photo in this blog post, and for marketing reasons I’m going to ignore the reality that Toby did all the updates for GHOST.

Qualys found during a code audit a buffer overflow exploit for gethostbyname() in glibc which they’ve named GHOST. This means that any internet facing software that can be persuaded to do a DNS lookup is potentially vulnerable. To a first approximation that’s everything that’s listening on an internet socket.

The details are in CVE-2015-0235. Note this explains quite comprehensively how to exploit the vulnerability so we are expecting active exploitation to have already started.

The vulnerability was announced at 16:30 on Tuesday, at 16:40 the first ticket was opened in our queue automatically. We started reviewing the information shortly thereafter and deployed the updated packages to our shared hosting servers Tuesday evening. This gives a short window to discover any critical issues with the new packages before we start deploying updates to our managed hosting customers.

At 8:30am on Wednesday, we emailed every managed customer running vulnerable code (which is almost but not quite all of them) explaining the issue and indicating we’d be applying the patches immediately unless otherwise instructed not to. Giving customers a short window to reply before going ahead (some are automatically deploying via Puppet and don’t want us to update for them) we then applied the updates to the customer servers, which involved very brief interruptions to listening services as they restarted.

Subsequently spot auditing some customer machines indicates that the glibc update via the package manager may not have restarted every vulnerable process. We’re now writing some audit tools to check for missing service restarts. Tomorrow morning at 6am, our reporting package will report in lots of data about the status of all our managed customer machines including the complete process list and complete list of listening services, so on our reporting box we can do a complete audit for every listening process that hasn’t been restarted in the last 24 hours and investigate and fix where necessary.

If you aren’t a managed hosting customer of Mythic Beasts we implore you to update your systems as soon as possible, we strongly expect that someone is going to build a very big denial of service botnet very quickly from this vulnerability. If you have no idea how to update and audit your server please get in contact with us at support @ mythic-beasts.com even if you’re not hosted with Mythic Beasts.

A very personal opinion

January 22nd, 2015 by

BadSecurityDevice

Today we’re at the UK Network Operators Forum and we’ve just had a talk from Kevin Williams, Partnership Engagement and National Cyber Crime Capabilities Manager at the National Crime Agency.

He was asked,

‘Do you believe that banning secure encryption will make the UK more secure’.

His answer was,

‘My personal opinion is no, and you can quote me on that’.

Which shows that at least one person in our government has some clue even if David Cameron doesn’t.