CloudShare experienced an outage on Sunday April 22, 2012. The outage affected all customers and began at 1:13 PM PST. An initial fix was in place at 4:35 PM PST which enabled some of the customers to access their already running environments, however performance and functionality was degraded until 6:23 AM when the service was restored. On Monday 2:30AM PST some of our ProPlus customers experienced a severe performance degradation and could not resume their environment or some of their environment machines performed poorly. The issue lasted until Monday 6:30AM PST where service and performance were fully restored.
Action & Root Cause
One of the nodes of our main storage unit connected to most of our servers and services, shutdown unexpectedly on the night of 4/21 - this is a high availability storage hence there was no service interruption at this stage.The service outage started later and was a result of an error by our storage vendor Field Engineer technician who was on-site to replace a malfunctioned motherboard of the storage node. After installing the new motherboard the technician made an error during the disk assignment to the new motherboard, and as a result the whole storage cluster became unavailable and all services connected to it including our front-end web servers and backend servers became unavailable as well. After the Field Engineer with the assistance of the storage vendor support straightened out the disk assignments, CloudShare infrastructure engineers had to re-attach the VMFS LUNs to our ESX servers manually and start a start a repair process for some of our servers and services which had issues as a result of the storage unexpected downtime.
Later on, the network interface of one of our cloud storage devices storing customers virtual machines disks got overloaded. As a result, machines connected to this device suffered from a severe performance degradation and became unusable, while other environments failed to resume. Our storage engineers with the assistance of the storage vendor support team troubleshooted the issue and changed the NIC configuration to offload it, which mitigated the issue and restored normal performance.
- 4/21/2012 10:02 PM PST - storage node shutdown unexpectedly and failover to the standby node started (no downtime)
- 4/22/2012 4:15 AM PST - investigation by storage vendor revealed a malfunctioned motherboard which needs to be replaced
- 4/22/2012 5:45 AM PST - storage vendor dispatches a field technician with a replacement motherboard
- 4/22/2012 12:15 PM PST - field technician starts with motherboard replacement
- 4/22/2012 1:13 PM PST - after the replacing the motherboard, the field engineer errantly assigns disks to the motherboard and as a result the storage cluster crashes
- 4/22/2012 1:30 PM PST- field engages storage vendor escalation team
- 4/22/2012 3:04 PM PST - escalation team fixes the issue and brings up both storage units online
- 4/22/2012 3:10 PM PST - datacenter team starts powering on servers
- 4/22/2012 3:40 PM PST - some servers fail to start and repair process is initiated
- 4/22/2012 3:45 PM PST - manual reattachment of VMFS LUNs start
- 4/22/2012 4:35 PM PST - powering on servers
- 4/22/2012 5:42 PM PST - all servers are up and backend servers warm up completed
- 4/22/2012 5:45 PM PST - service sanity tests starts
- 4/22/2012 7:00 PM PST - services sanity completed successfully
- 4/23/2012 2:30 AM PST - NOC engineer receive alerts about storage performance issues
- 4/23/2012 2:45 AM PST - level 1 investigation of performance issue started
- 4/23/2012 3:10 AM PST - ticket is escalated to storage engineers
- 4/23/2012 3:40 AM PST - storage engineer contact the storage manufacturer support and open severity 1 support case.
- 4/23/2012 4:15 AM PST - starting to troubleshoot the issue with the storage vendor help
- 4/23/2012 4:50 AM PST - investigation revealed the one of the NICs is having sporadic errors .
- 4/23/2012 5:30 AM PST - changing a configuration parameter on the storage and refreshing the NIC setting solve the issue
- 4/23/2012 6:00 AM PST - performance starts stabilizing
- 4/23/2012 6:30 AM PST- standard performance restored and all machines went back to normal
The storage cluster vendor of our main storage unit provided us with a detailed post mortem. We are working with the vendor to reduce the risk of human error.
We are also working with our Cloud Storage vendor to understand why the NIC became overloaded and considering changing some hardware components
In addition we are working on improving our communication channels to communicate with our customers during maintenance and unplanned outages. We are planning to announce these in during the next couple of months.
We are continuing to spend time investigating all aspects of the issues outlined above. We plan to take steps to address and improve out service. We take our SLA seriously and we know that you as a customer rely on our services. We sincerely apologize for any impact and disruption you may have experienced, and thank you for taking the time to read this report.
Thanks for reading!