Last week and this week (Jan14-23) we had several disruptions impacting the North American (NA) datacenter of Business Catalyst. The impact consisted of either short windows of downtime or performance degradation.
This is a summary of the service disruption that goes through the type of incidents, their impact, their root cause and measures taken to fix the issues.
Last week on Tue, Fri and Sat we experienced three windows of downtime of 10 to 30 min each caused by a physical malfunction of a database server hosting a large portion of the sites in NA. The malfunction was coupled with a failure of the failover database to come online automatically, therefore the operation had to be performed manually, which actually caused the downtime. For the first two incidents the root cause was not properly identified at the time and the server was maintained in production but in a secondary role, as failover for the new primary database server. The physical malfunction was confirmed on Sat and the complication was that we did not have a spare server on the spot to fit that profile: high performance 72 GB RAM non-virtualized machine for running a large database. We had to change the topology of the entire database setup to accommodate that database shard on existing hardware over the week-end in order to ensure that we kept high availability for all databases.
Meanwhile the malfunctioning equipment was replaced and successfully re-incorporated in the production infrastructure.
Unrelated to this, there were several incidents during last week where our monitors indicated short interruptions of DNS services and other internal services in the NA datacenter in burst of 1-4 minutes for unknown causes at the time.
After a deep investigation we were able to pinpoint the root case in the fact that the huge amount of spam generated by comments to blogs and forums became actually a severe bottleneck for a piece of networking equipment in the datacenter (internal switch between web and database layer).
In hindsight this is also the root cause for performance degradation reported by several partners on the NA datacenter.
A temporary fix was deployed on Thu limiting the number of comments retrieved for a page to 1,000. The proper fix was be released this week with R183. Additionally, we took steps to initiate a hardware upgrade of the network infrastructure and to improve logging and monitoring for this layer.
This week on Wed we had a hot-fix release that we had to push to production during NA business hours, which we usually avoid, and during the release process the performance was affected for a short period while part of the servers were taken offline to be upgraded with the latest version of the application.
We are exploring solutions that will allow deployment during peak traffic in a geography without impacting the performance of the respective datacenter.
Also on Wed the NA datacenter experienced a massive traffic surge that temporarily impacted performance for admin operations and sites front-end serving.
The traffic spike was contained and load returned to normal shortly afterwards.
We know how critical our services are to our customers’ businesses, we regret the problems caused by the disruptions and we apologize for the inconvenience.
This document was generated from the following discussion: North America datacenter performance issues