I do not accept your silly software license

September 9th, 2013 by

So our newest Mythic Beast started working for us today. The first task is to start installing your new laptop and reading and signing employment contracts. Today we had our newest employee fail at the first hurdle.

The laptop in question is a shiny Toshiba z930. This one came with Windows 8 and a fully charged battery. On first powering it on it comes up with the Windows 8 licence. This has a tickbox option for ‘I accept the license’ and a big button labelled ‘Accept’ to click on.

If you don’t tick the box, it tells you you have to. There’s no option to reject the license.

If you press the power button the laptop suspends itself. If you press and hold the power button the laptop still suspends itself. Ctrl-Alt-Delete doesn’t work. You can’t remove the battery as it’s built in. In frustration our newest employee suggested pouring his coffee over the damn thing to make it power cycle. This was a really stupid idea, not only does the laptop have a spill proof keyboard he’d also then have no coffee.

The best plan we could come up with was to wait for the batteries to run out which requires pressing a key about every five minutes to stop the thing suspending itself.

New DNS resolvers

August 28th, 2013 by

We’ve upgraded our DNS resolvers in our SOV and HEX data centres. New features include DNSSEC validation and IPv6.

The addresses are,

SOV : 2a00:1098:0:80:1000::12 / 93.93.128.2
HEX : 2a00:1098:0:82:1000::10 / 93.93.130.2

They’re now DNSSEC aware and validating resolvers. That means if a site has correctly configured DNSSEC and we receive an answer that fails the security check we will return no answer rather than an incorrect/forged one.

To demonstrate the difference,

a non dns sec validating resolver :
# dig +short sigfail.verteiltesysteme.net
134.91.78.139

a mythic beasts server using our resolvers
# dig +short sigfail.verteiltesysteme.net
<no answer>
#

and on the DNS server it logs an error,

debug.log:28-Aug-2013 15:44:57.565 dnssec: info: validating @0x7fba880b69e0: sigfail.verteiltesysteme.net A: no valid signature found

and correctly drops the reply.

Googles DNS servers on 8.8.8.8 work the same as ours so we’re fairly confident that there will be no compatibility issues.

Downstream ASN

August 12th, 2013 by

With a customer of ours we have set them up their own full BGP network, split across two of our London sites. With advice from us we have

  • Helped them join RIPE as an LIR
  • Helped them apply for an IPv6 /32 and an ASN
  • Set up a full BGP IPv6 only network
  • Helped them apply for a final /22 of IPv4 space
  • Configured this in the global routing table

They have the option to now cable or fibre direct to peering exchanges and other ISPs should they wish to do so on individual machines hosted within our rackspace. In the mean time they’re taking advantage of our co-location, out of band access to their routers via serial and our IPv4 and IPv6 transit.

Joining the London Internet Exchange

August 7th, 2013 by

We’ve now joined the London Internet Exchange and are present on both of their peering LANs for redundancy. We’re connected to the Juniper LAN in Sovereign House and the Extreme LAN in Harbour Exchange. We’re now connected to three peering exchanges – Edge-ix,LoNAP and LINX-juniper in Sovereign House, and two – LINX-extreme and LoNAP in Harbour Exchange.

You can see the current traffic over the LINX public exchanges here

which is best described as rather a lot. We’re in the process of setting up more direct peers in addition to the route servers which provided immediate peering with hundreds of ISPs and tens of thousands of routes. So many UK destinations are now a few hops shorter – which probably won’t be very noticeable – but we have improved redundancy and increased capacity.

Dark Fibre

August 5th, 2013 by

Over the last twelve months we’ve made a series of networking changes and completely failed to blog about them. Our first announcement is that we now have a dark fibre ring around our core London sites.

This isn’t actually true. We now have a lit fibre ring around our core London sites. It’s currently running at 10Gbps and connects all of our routers together. All our routers connect to the local networks at 10Gbps so our entire network core is now 10Gbps. We also have some direct customer connections who are using our fibre as a layer 2 interlink between Telecity Sovereign House, Telecity Meridian Gate and Telecity Habour Exchange 6/7. Our standard is to offer a pair of ports in each site on redundant switches (so 6 x 1Gbps ports) with unlimited traffic between them.

As a result of our upgrade we’re able to continue to offer free traffic between all London hosted servers irrespective of the building the machines are in or which customer owns them – we bill only for traffic that leaves our network. Upgrading to progressively higher bandwidths is now straightforward as we can add CWDM / DWDM as required to increment in multiples of 10Gbits, or to 40Gbits or multiples of 40Gbits.

For those of you that are interested, the fibre lengths are

  • MER <-> SOV : 1672ns (or 1122ft)
  • SOV <-> HEX : 6423ns (or 4310ft)
  • HEX <-> MER : 5456ns (or 3687ft)

and the latencies across the network from core router to core router (average over 10 pings) are

  • MER <-> SOV : 0.096ms
  • SOV <-> HEX : 0.082ms
  • HEX <-> MER : 0.076ms

and from customer machine in SOV to customer machine in HEX, passing through at least two routers – 0.5ms.

Power efficiency

July 31st, 2013 by

We’ve just done a performance comparison of one of our little dedicated servers versus a dual core VM hosted on VMware through another provider.

The VMware machine has two cores of an Intel Xeon E5530 (Westmere) at 2.4Ghz, we have four hyperthreaded cores of an i7-3615QM (Ivy Bridge) at 2.3Ghz.

Both machines are running the same operating system install, same application code so we ran siege for 30s at a time with different concurrency levels as a benchmark to find out if our machine was faster and by how much.

The initial comparison was the dual core VMware service (green), versus our VM (red). At very low concurrency (1-2 simultaneous requests) our machine is slightly slower to render each page. Beyond this the existing machine has exactly the predicted load curve in that it slows linearly with additional simultaneous users – the new machine appears to slow only very slightly with minimal performance degradation.

By default we’re running the ondemand cpu scheduler which means when idle the cores are clocked at 1.2Ghz. The page render time remains almost constant to four cores as the host spreads the load around the four 1.2ghz cores keeping the render time constant. Beyond this the scheduler starts to turn up the core speed as the load rises, so at 8 cores we’re still rendering pages in the same average time because each core is now clocked at 2.3Ghz almost twice as fast – we’ve doubled the amount of CPU available. Only then does the performance begin to drop off and then sub-linearly.

On the existing host the performance is much more predictable – it takes a constant amount of CPU to render each page request and as you double the concurrency the render time doubles.

If you turn off the power-saving on the i7 and set it to performance mode it gives the expected linear performance decrease with increasing load. Interestingly it’s slightly slower at maximum CPU utilisation, I think (but haven’t confirmed) this is because it can’t use the turbo boost feature to increase the clock-speed as much as the power-saving option because it’s always running at a warmer temperature as it doesn’t cool down as much between each benchmark run.

We’re going to leave the machine in ondemand mode, whilst it’s slightly slower in normal use, it uses less electricity so it cheaper to run and less harmful to the polar bears, it also has significantly better performance for short peaks – it has a stockpile of cold that it can borrow against for short periods of time.

I wonder if they should start teaching thermodynamics in computer science courses.

Debugging IPv6 support

March 27th, 2013 by

One of our customers is running a monitoring network First2Know and has one of the monitoring nodes hosted with us in one of our London sites. He chose us because we have IPv4 and IPv6 support and the monitoring network does full IPv4/IPv6 monitoring from every location. He kept seeing connectivity issues between his Raleigh node hosted by RootBSD and his London node hosted by us.

Initial investigation indicated that only some IPv6 hosts on our network were affected, in particular he could reliably ping only one of two machines with IPv6 addresses in the same netblock hosted on the same switch within our network. We escalated the issue with us and RootBSD and they helpfully gave me a VM on their network so I could do some end to end testing.

Analysing at both ends with tcpdump indicated that packets were only being lost on the return path from RootBSD to Mythic Beasts, on the out path they always travelled fine. Testing more specifically showed that the connectivity issue was reproducible based on source/destination address and port numbers.

This connect command never succeeds,

# nc -p 41452 -6 2607:fc50:1:4600::2 22

This one reliably works,

# nc -p 41451 -6 2607:fc50:1:4600::2 22
SSH-2.0-OpenSSH_5.3

What’s probably happening is somewhere along the line the packets are being shared across multiple links using a layer3 hash, this means the link is chosen by an implementation like

md5($source_ip . $source_port . $destination_ip . $destination_port) % (number of links)

This means that each connection always sees the packets travel down the same physical link minimising the risk of a performance loss due to out of order packet arrival, but each connection effectively gets put down a different link at random.

Statistically we think that either 1 in 2 or 1 in 3 links at the affected point were throwing our packets away on this particular route. Now nobody in general has noticed because in dual stack implementations it falls back to IPv4 if the IPv6 connection doesn’t connect. We only found it because this application is IPv6 only; our IPv6 monitoring is single stack IPv6 only.

Conversation with RootBSD confirmed that the issue is almost certainly within one of the Tier 1 providers on the link between our networks, neither of us have any layer 3 hashing options enabled on any equipment on the path taken by the packets.

Now in this case we also discovered that we had some suboptimal IPv6 routing, once we’d fixed the faulty announcement our inbound routes changed and became shorter via a different provider and all the problems went away and we were unable to reproduce the issues again.

However as a result of this we’ve become a customer of First2Know and we’re using their worldwide network to monitor our global IPv4 and IPv6 connectivity so we can be alerted and fix issues like these well before our customers find them.

If this sounds like the sort of problem you’d like to work on, we’re always happy to accept applications at our jobs page.

RaspberryPi crash

February 20th, 2013 by

After seven months and two weeks of uptime our Raspberry Pi mirror server fell over yesterday and required a power cycle to bring it back up. It lasted longer than it’s first USB hard disk which failed after about six months. Examining the logs suggests that the flash card is dying, yesterday it remounted read only and the network stack fell over. /var/log is on the external USB drive so we were able to see that the machine was alive, it could log ethernet connect/disconnect, it just couldn’t start the network back up.

During the time it was up it shipped about 1.5TB of downloads running an average of 3Mbps of traffic, quite regularly peaking at 10Mbps+.

Some more benchmarks

January 5th, 2013 by

CPU benchmarks

Here are some geekbench readings (32bit tryout version) for some of our servers and for comparison some Amazon EC2 images.

server Geekbench
Dual Hex Core 2Ghz Sandybridge (debian) (E5-2630L) 18265
Hex Core 2Ghz Sandybridge (debian) (E5-2630L) 11435
Quad Core 2.3Ghz Ivy Bridge (ubuntu) (i7-3615QM) 12105
Quad Core 2.0Ghz Sandy Bridge (debian) (i7-2635QM) 9135
Dual Core 2.3Ghz Sandy Bridge (debian) (i5-2415M) 6856
Dual Core 2.66Ghz Core 2 Duo (debian) (P8800) 3719
Dual Core 1.83Ghz Core 2 Duo (debian) (T5600) 2547
Toshiba z930 laptop (Ivy Bridge i7-3667U) 6873
Amazon EC2 t1.micro instance (ubuntu) (E5430 1 virtual core) 2550
Amazon EC2 c1.xlarge instance (ubuntu) (E5506 8 virtual cores) 7830
Amazon EC2 hi1.4xlarge instance (ubuntu) (E5620 16 virtual cores) 10849
Azure Small (1 core AMD Opteron(tm) Processor 4171 HE @ 2.09 GHz / 1.75GB) 2510
Azure Extra Large (8 core AMD Opteron(tm) Processor 4171 HE 2.09Ghz / 14GB) 7471
Elastic Hosts ‘2000Mhz’ single core VM (Opteron 6128) 2163
ElasticHosts ‘20000Mhz’ eight core VM (Opteron 6128) 6942
Linode 512MB VDS (L5520 4 virtual cores) 4469
Mythic Beasts 1GB VDS (L5630 1 virtual core) 2966
Mythic Beasts 64GB VDS (L5630 4 virtual cores) 4166

The method here is pretty simple. Take the default OS install, copy geekbench 32 bit tryout edition onto the machine. Run it and record the results.

It’s important to remember that geekbench performs a mixture of tests, some of which don’t parallelise. This means a server with a fast core will receive a higher score than one with lots of slower cores. As a result the sandybridge and ivybridge machines score very highly because the servers will increase the performance of a single core if the other cores are idle.

Disk benchmarks

We have several disk subsystems available. Single disk, dual disk mirrored software RAID, dual disk mirrored hardware RAID, 8 disk array hardware RAID and PCI-E SSD accelerator card.

Read only benchmarks

The benchmark here is carried out with iops, a small python script that does random reads.

4kb reads

IO Subsystem IOPS Data rate
Single SATA disk 60.5 242kB/sec
Mirrored SATA disk 149 597kB/sec
Hardware RAID 1 SATA disk 160.2 640kB/sec
Hardware RAID 10 SATA 6-disk 349 1.4MB/sec
Hardware RAID 10 4 disk Intel 520 SSD 21426 83.7MB/sec
Hardware RAID 0 6 disk SAS 15krpm 104 416kB/sec
Intel 910 SSD 28811 112MB/sec
Apple 256GB SATA SSD 21943 85.7MB/sec
Intel 710 300GB SSD RAID1 Hardware BBU 24714 96.5MB/sec
Amazon micro instance (EBS) 557 2.2MB/sec
Amazon c1.xlarge instance (local) 1746 6.8MB/sec
Amazon c1.xlarge instance xvda (local) 325 1.2MB/sec
Amazon m1.xlarge EBS optimised, 2000IOPS EBS 69 277kB/sec
Amazon hi.4xlarge software RAID on 2x1TB SSD 22674 88.6MB/sec
Azure small (sda) 73.3 293kB/sec
Azure small (sdb) 16010 62.5MB/sec
Azure Extra Large (sda) 86.4 345kB/sec
Azure Extra Large (sdb) 10136 39.6MB/sec
Elastic Hosts Disk storage 54.1 216.6kB/sec
Elastic Hosts SSD storage 437 1.7MB/sec
Mythic Beasts 1G VDS 65.3 261KB/sec
Linode 512MB VDS 475 1.9MB/sec

1MB reads

IO Subsystem IOPS Data rate
Single SATA disk n/a n/a
Mirrored SATA disk 48.7 48.7MB/sec
Hardware RAID 1 SATA disk 24.9 24.9MB/sec
Hardware RAID 10 SATA disk 23.2 23.2MB/sec
Intel 910 SSD 525 524MB/sec
Apple 256GB SATA SSD 477 477MB/sec
Intel 710 300GB SSD RAID1 Hardware BBU 215 215MB/sec
Hardware RAID 10 4 disk Intel 520 SSD 734 734MB/sec
Hardware RAID 0 6 disk SAS 15krpm 24 24MB/sec
Amazon micro instance (EBS) 71 71MB/sec
Amazon c1.xlarge instance xvdb (local) 335 335MB/sec
Amazon c1.xlarge instance xvda (local) 81.4 114MB/sec
Amazon m1.xlarge EBS optimised, 2000IOPS EBS 24 24MB/sec
Amazon hi.4xlarge software RAID on 2x1TB SSD 888 888MB/sec
Azure small (sda) n/a n/a
Azure small (sdb)
Azure Extra Large(sda) n/a n/a
Azure Extra Large(sdb) 1817 1.8GB/sec
Elastic Hosts Disk storage n/a n/a
Elastic Hosts SSD storage 49.6 49.6MB/sec
Mythic Beasts 1G VDS 44.7 44.7MB/sec
Linode 512MB VDS 28 28MB/sec

It’s worth noting that with 64MB reads the Intel 910 delivers 1.2GB/sec, the hi.4xlarge instance 1.1GB/sec (curiously the Amazon machine was quicker with 16MB blocks). At the smaller block sizes the machine appears to be bottlenecked on CPU rather than the PCI-E accelerator card. The RAID10 array had a stripe size of 256kB so the 1MB read requires a seek on every disk – hence performance similar to that of a single disk as the limitation is seek rather than transfer time. There’s a reasonable argument that a more sensible setup is RAID1 pairs and then LVM striping to have much larger stripe sizes than the controller natively supports.

We’re not sure why the SAS array benchmarks so slowly, it is an old machine (five years old) but is set up for performance not reliability.

Write only benchmarks

I went back to rate.c, a synchronous disk benchmarking tool we wrote when investigating and improving UML disk performance back in 2006. What I did was generate a 2G file, run random sized synchronous writes into it and then read out the performance for 4k and 1M block sizes. The reasoning for a 2GB file is that our Linode instance is a 32bit OS and rate.c does all the benchmarking into a single file limited to 2GB.

Write performance

IO Subsystem IOPS at 4k IOPS at 1M
Software RAID 1 84 31
Linode 512MB VM 39 25
Mythic Beasts 1G VM 116 119
Mythic Beasts 1G VM 331 91
Mythic Beasts 1G VM 425 134
2x2TB RAID1 pair with BBU 746 54
6x2TB RAID10 pair with BBU 995 99
400GB Intel 910 SSD 2148 379
256GB Apple SATA SSD 453 96
2x300GB Intel 710 SSD RAID1 pair with BBU 3933 194
Hardware RAID 10 with 4xIntel 520 SSD 3113 623
Hardware RAID 0 with 6x15krpm SAS 2924 264
Amazon EC2 micro, EBS 78 23
Amazon EC2 m1.xlarge, EBS 275 24
Amazon EC2 m1.xlarge, EBS provisioned with 600IOPS 577 35
Amazon EC2 m1.xlarge, instance storage 953 45
Amazon EC2 m1.xlarge, EBS optimised, EBS 246 27
Amazon EC2 m1.xlarge, EBS optimised, EBS with 2000IOPS 670 42
Amazon EC2 hi.4xlarge, software RAID on 2x1TB SSD 2935 494
Azure small (sda) 24.5 5.8
Azure small (sdb) 14 11
Azure Extra Large (sda) 34 6
Azure Extra Large (sdb) 6.1 5.1
Elastic Hosts disk storage 12.8 7.7
Elastic Hosts ssd storage 585 50

I think there’s a reasonable argument that this is reading high for small writes on the BBU controllers (including the VMs & Linode VM). It’s entirely possible that the controllers manage to cache the vast majority of writes in RAM and the performance wouldn’t be sustained in the longer term.

Real world test

We presented these results to one of our customers who has a moderately large database (150GB). Nightly they take a database backup, post process it then reimport it to another database server in order to do some statistical processing on it. The bottleneck in their process is the database import. We borrowed their database and this is the timing data for a postgresql restore. The restore file is pulled from the same media the database is written to.

Server Time for import
Hex core 2.0Ghz Sandy Bridge, 128GB RAM, 2TB SATA hardware RAID 1 with BBU 2h 35m 24s
Hex core 2.0Ghz Sandy Bridge, 128GB RAM, 400GB Intel 910 SSD 1h 45m 8s
Hex core 2.0Ghz Sandy Bridge, 128GB RAM, 2x300GB Intel 710 SSD hardware RAID 1 with BBU 2h 0m 33s
Quad core 2.3Ghz Ivy Bridge, 4GB RAM, 1TB SATA software RAID 1 4h 16m 14s
Quad core 2.3Ghz Ivy Bridge, 16GB RAM, 1TB SATA software RAID 1 3h 38m 3s
Quad core 2.3Ghz Ivy Bridge, 16GB RAM, 256GB SATA SSD 1h 54m 38s
Quad core E3-1260L 2.4Ghz Ivy Bridge, 32GB RAM, 4xIntel 520 SSD hardware RAID 10 with BBU 1h 29m 33s
Hex core E5450 3Ghz 24GB RAM, 6x15krpm SAS hardware RAID 0 with BBU 1h 58m
Amazon EC2 m1.xlarge with 200GB of 600IOPS EBS 5h 55m 36s
Amazon EC2 m1.xlarge with 200GB of 2000IOPS EBS 4h 53m 45s
Amazon EC2 hi.4xlarge with 2x1TB RAID1 SSD 2h 9m 27s
Azure Extra Large sdb (ephemeral storage) 6h 18m 29s
ElasticHosts 4000Mhz / 4GB / 200GB hard disk 5h 57m 39s
ElasticHosts 20000Mhz / 32GB / 200GB SSD 3h 16m 55s
KVM Virtual Machine (8GB / 8 cores) running on 16GB 2.3Ghz Ivy Bridge Server, software RAID1 with unsafe caching 4h 10m 30s

The postgres import is mostly single threaded – usually the servers sit at 100% CPU on one core with the others idle with only occasionaly bursts of parallelism. Consequently usually the CPU is bursting to 2.5Ghz (Sandy Bridge) or 3.3Ghz (Ivy Bridge). The Ivy Bridge RAID1 machine is actually a Mac Mini. In many ways this is an application perfectly suited to ‘the cloud’ because you’d want to spin up a fast VM, import the database then start querying it. It’s important to note that the estimated lifetime of the Intel 520 RAID 10 array in this test is six months, the performance gain there over the 910SSD is entirely due to faster single threaded performance on the CPU.

Bias

Whilst I’ve tried to be impartial obviously these results are biased. When Mythic Beasts choose hardware for our dedicated and virtual server platforms we deliberately search out the servers that we think offer the best value, so to some extent our servers have been chosen because historically they’ve performed well at the type of benchmarks we test with. There’s also publication bias, if the results said emphatically that our servers were slow and overpriced we’d have fixed our offering, then rewritten the article based on the newer faster servers we now had.

Notes

The real world test covers two scenarios, the delay in getting a test copy of the database for querying for which temporary storage may be fine, plus in the event of something going hideously wrong a measure of the downtime of your site until it comes back up again in which case persistent storage is terrifically important.

I plan to add an m2.xlarge + EBS instance, and a hi1.4xlarge instance. I originally didn’t include the hi1.4xlarge because they don’t have EBS optimised volumes for persistent storage. I might also add some test Mythic Beasts VMs with both safe and unsafe storage (i.e. you cache all writes in the host RAM ignoring sync calls) which is a cheap and easy way to achieve instance storage with a huge performance benefit. I excluded the Linode VM from the final test as it’s too small.

New interview question

December 5th, 2012 by

One of our customers decided to send us a more unusual christmas card in the form of a puzzle. It was this,

if you do

cd /dev
mv *. *..

how do you restore the system to a working state?

To add hypothetical pressure to this hypothetical question imagine that the users of the hypothetical server had just done a major press launch.

Our answer was, if your hypothetical panicking system administrator had bought our managed service we could restore it from backup. As a favour we’d even restore the backup to a new virtual server allowing us to keep the filesystem on the old machine so we’d have a chance of pulling out any missing updates that had happened since the backup that morning. We could then merge it all back on to the original platform at a slightly less stressful time.

Happily our backup and restore procedures do get tested, so we were able to fairly straightfowardly restore from the backup server, build the new VDS and bring it up running the site.

One thing our newest employee didn’t know was about the usefulness of netcat so I was able to make him one of todays lucky 10,000.[*] If you’re running a debian install CD you’ll notice it doesnt have ssh or rsync but does have netcat. Consequently you can transfer your data nice and quickly as follows,

backup# iptables -I INPUT 1 -p tcp -s -J ACCEPT
backup# cat backed-up-filesystem.tar.gz | nc -p xxx -l
target# nc -p xxxx | gunzip | tar xv
backup# iptables -D INPUT 1

As always, if you read this and think “that’s clever, I wonder if I’d have thought of that” you should take a look at our managed services. If you’d definitely done it better[**] then we invite you to show off at our puzzle.

[*] Scaling for the UK birth rate rather than the US one he’s one of todays lucky 1,600. You’d have to be pedantic even by the standards of system administrators to complain about that though.

[**] Before some points out that tar has a z flag we are aware of that. However computers have lots of processors and this way tar / gunzip / nc can all sit on different cores and make it go faster which is important if you’ve people waiting for the site to come back up.