LATEST NEWS - FAST WEB HOSTING UK
27th November 2016
We have purchased the FAST WEB HOSTING UK servers and software, and will be implementing a lot of new features (for no extra charge) soon. The best part of this aquisition is that the hard drives are all SSD drives. This offers a real advantage over mechanical, spinning drives with blistering speeds, especially for shopping cart and data driven sites. Many other advatages are FREE SEO tools and marketing analysis. Contact us for more information on all this at: email@example.com
More in the Christmas Newsletter, coming soon.
10th October 2016
REVISED PRICES AND NEW SERVICES ACROSS ALL SECTORS
Message from the Digital Network's Proprietor - Price increases across all services
I've struggled to maintain my prices for many years now, and it's with great reluctance that I'm forced to raise my prices, as the rising costs of postage, delivery services, server costs and wholesale telecomms prices, means that this has now become neccessary. Many of you will know that I've not raised my prices since 2001 - when I started. The price increases are small in many cases and costs compared to rivals are still lower, and indeed, they are still amongst the best prices across the sectors I operate in.
My intention is to contact all of you individually, over the next few weeks, to see if there's anything I can do to give long standing clients a competetive deal. I will also include this news item with all invoices that are due shortly.
The web site will shortly contain information on all the new price increases (for all services), and will also offer current comparisons to other sector rivals.
To conclude on a good note, we have now started to roll out our Fibre Optic products, and I know this has been hotly anticipated by many of you. I intend to install this on those clients' lines that are furthest from the exchange first (long line clients), as those customers will see the biggest improvement in broadband speed for years. If you're a customer with us already there will be no charge for activation, but I may need to change the master socket, and do the installation work. This will be charged at a flat rate of £40.00 per customer (excluding very long cable), plus the cost of the new VDSL/ADSL Master Socket which is £12.00. All quoted prices exclude VAT at the prevailing rates.
In the meantime, thank you to those clients who have stood by me and supported me loyally since launch, you will get a priority over to the new fibre products (where this available), and I hope to have all those completed within the next month or so.
In closing, I will be announcing many new products, in all our service sectors, very soon.
James Gillan esq.
The Digital Network
16th February 2016
THE RECENT MAJOR DATA CENTRE OUTAGE
As you may be aware, we recently suffered the worst single incident in our Web Hosting history (16 years), due to a major power outage at our server co-location centre, in Leeds, kast Wednesday afternoon.
Emergency maintenance work was being carried out on the load transfer module, which feeds power from to the external energy supplies at the data centre hall that holds the majority of our servers. We also have another server in Manchester, which was unaffected. The data centre in Leeds has 2 dual feed uninterruptible supplies both backed by diesel generators in case of any National Grid outages.
Unfortunately, a safety mechanism within the device triggered incorrectly, and resulted in a power outage of fewer than 9 minutes. Subsequently, this caused approximately 15,000 servers at the server centre, to be hard booted. Beyond a fire, this is the worst possible event that a hosting company can face. A full post mortem is currently being carried out to determine how power was lost on both supplies despite working with the external engineer from the hardware manufacturer.
What happens when servers hard reboot?
Web servers and virtual servers typically perform database transactions at a very high rate, meaning that the risk of database or file system corruption is quite high when a hard reboot occurs.
Following the restoration of power, the first priority is to get the primary infrastructure boxes back online, then managed and unmanaged platforms. The managed platforms are built to be resilient, so although the centre lost a number of servers in the reboot, the majority of the platforms came up cleanly. Faced with some issues with their Premium Hosting load balancers, which needed repairing, some customer sites were off for longer than we would have hoped. The centre is adding additional redundant load balancers, and modifying the failover procedure over the next 7 days, as an extra precaution for them and our own customers.
On the shared hosting platform, a number of NAS drives, which sit behind the front-end web servers and hold customer website data, crashed and could not be recovered. However, they are set up in fully redundant pairs and the NAS drives themselves contain 8+ disk RAID 10 arrays. In every case but one, at least one server in each pair came back up cleanly, or in an easily repairable state, and customer websites were back online within 2-3 hours.
In a single case, the cluster containing web 75-79, representing just under 2% of our entire shared platform, both NAS drives failed to come back up. Following our disaster recovery procedure, they commenced attempts to restore the drives, whilst simultaneously building new NAS drives should they be required. Unfortunately, the servers gave a strong, but false, indication that they could be brought back into a functioning state, so prioritised attempts twere made to repair the file system.
Regrettably, following a ‘successful’ repair, performance was incredibly poor due to the damage to the file system, and the centre was forced to proceed to the next rung of their disaster recovery procedure. The further you step into the disaster recovery process, the greater the recovery time, and here they were looking at a total 4TB restore from on-site backups to new NAS drives. (For your information the steps following that are to restore from offsite backup and finally restore from tape backup although they did not need to enact these steps.) At this point, it became apparent that the issue would take days rather than hours to resolve, and the status page was updated with an ETA. They restored sites to the new NAS drives alphabetically in a read-only state and the restoration completed late Sunday afternoon.
A full shared cluster restore from backups to new NAS is a critical incident for them, and they routinely train their engineers on disaster recovery steps. The disaster recovery process functioned correctly, but because the event did not occur in isolation, they were unable to offer the level of individual service that they really wanted to, and that we would expect from them (e.g. individual site migration during restoration).
Given the magnitude of this event, they are currently investigating plans to split their platform and infrastructure servers across two data centre halls, which would allow them/us to continue running in the event of a complete power loss to one. This added reliability is an extra step that the centre feels is necessary to put in place to ensure that this never happens again to them, or indeed, our own customers.
VPS and Dedicated Servers
For their unmanaged platforms (VPS and Dedicated Servers), the damage was more severe, as by default these servers are not redundant or backed up. In particular, one type of VPS was more susceptible to data corruption in the event of a power loss due to the type of caching the host servers use. They have reported that they have now remedied this issue on all re-built VPS involved in the outage, and no active or newly built VPS now suffer from this issue.
They did lose two KVM hosts (the host servers that hold VPS, approximately 60-80 servers per VPS KVM host, 6-12 servers per Hybrid KVM host). The relatively good news was that the underlying VPS data was not damaged, although further to this, they also lost two KVM network switches which needed to be swapped out, which did result in intermittent network performance on other VPS during the incident.
To bring the VPS back online, the KVM hosts needed to have replacements built and VPS data copied from each before being brought back online. For every other VPS, the host servers were back up and running within 2 hours, but in many cases, the file systems or databases of the virtual machines on those servers were damaged by the power loss. For these VPS, by far the quickest course of action for customers to get back up and running immediately was a rebuild and restore from backups (either offsite or via our backup service).
However, they informed us that many of the affected VPS customers did not have any backups (irrespective of whether the backup was with us), and the only copy of the server’s data was held in a partially corrupted form on their KVM hosts, so they took steps to attempt to get our customers back online. For every affected VPS they ran an automated fsck (file system check) in an effort to bring the servers back online in an automated fashion. This would not, however, fix issues with MySQL (used in many shopping carts), which would be the most common issues due to high transaction rate. Tables left open during a power loss are likely to result in corrupted data, so they provided a do-it-yourself guide to try and get MySQL into a working state.
This also provided the option for them to attempt a repair, which typically takes 2-3 hours per server with an expected success rate of approximately 20%. They currently have a backlog of servers they have agreed to attempt to recover, but given the time per investigation, this is likely to take most of the week. This is roughly equivalent to the total loss of their NAS pair and is where disaster recovery steps (server rebuild and backup restoration) should be followed.
As these servers are unmanaged, there is no disaster recovery process in place by default. I know this isn’t the answer many of you want to hear, and most of all we want to ensure that this can never happen to you again. All VPS hosts are now set to be far more resilient in the event of a sudden power loss.
Support and Communications
During this incident, we have worked our hardest with the data centre, to ensure that our entire customer base was kept informed of our progress through our status page, and by personal texts and telephone, even those customers who don't pay for any support.
Given the scale of the issue, the load on our Customer Services support team was far in excess of normal levels. On a standard day, they, typically, handle approximately 800 support tickets, which can rise to 1600 during a fairly major incident. At absolute capacity, they can handle approximately 2000 new tickets per day.
This event was unprecedented, so during and following the incident they received in excess of 5000 new support tickets every day (excluding old tickets that were re-opened), and the ticket complexity was far higher than usual. Our own admin service which is handled by James, and three full-time data centre staff that's operated on behalf of the Digital Network, and also the data centre's own admin system, was not set up to handle this number of requests (being poll heavy to give our team quick updates on our ticket queue). This heavily impacted the performance of our/their control panel and ticketing system, until they made alterations to make it far less resource intensive.
After this, they took immediate steps to ameliorate the incredible support load via automated updates to affected customers, these reports are always updated in real time at www.webhostingstatus.com but most of the tickets required in-depth investigation and server repairs that require a high level of technical capability, so could only be addressed by the data centre's' own second line and sysadmin staff. It will take some time to clear their/our entire ticket backlog and restore normal ticket SLAs.
They had planned to go live with a brand new customer specific status page on the day of the outage, as it would allow us to provide greater detail for customers without the requirement that messages be updated by purselves.
They did not push this live during the incident as they needed all hands on deck to fix the live issues, but they have just made it live. We will be looking at this in more detail ourselves, but in the meantime, please continu to use www.webhostingstatus.com. The new service allows for subscription via email, SMS, and RSS, so you will be kept up-to-date during any future major incident. Past events are also archived and remain fully visible. As many of you know, we also use this page to inform you of any changes to the platform or scheduled maintenaince work.
We chose to co-locate with our data centres because of the superb support we receive from them, and indeed all our telephony and broadband services teams. We spend a huge amount of money on reinvestment in all our technologies, so that you, the customer, continue to receive the best industry service levels there is acrooss all our platforms.
In closing this news item, I would like to apologise to you, our customers. We know as much as anyone how important staying online is to your business. The best thing we can do to regain your trust is to continue to offer good, uninterrupted service long into the future, and that has always been our utmost priority.
Thank you all for your patience during this major incident.
THE DIGITAL NETWORK