Follow

Service Disruption - Environment Provisioning, 4/28/2014

On Monday April 28th 2014, we experienced a service disruption in which some environments' VMs failed to start or started without their memory state, as well as a performance degradation resulting in longer than normal environment provisioning times.

Timeline 

  • 4/28/2014, 3:00PM PDT: the issues started around when we detected that several software components were failing to mount data volumes from one of our memory storage nodes. Our engineering team stabilized the issue and contacted the storage vendor support team.
  • 11:00PM PDT: we attempted to reboot the storage appliance, but were unsuccessful. We configured our service to provision environments without the memory state, while we continued working on the boot issue.
  • 4/29/2014, 12:30AM PDT: our engineers concluded that the RAID controller on the storage appliance motherboard was faulty and RAID mirroring failed as well, hence we decided to install the storage operating system from scratch.
  • 3:30AM PDT: we successfully rebooted the appliance and started reconfiguration.
  • 4:10AM PDT: we successfully imported the data volumes, no data was lost.
  • 5:30AM PDT: the appliance was reconnected to our provisioning engine.
During the following days 4/30/2014 to 5/1/2014 we experienced intermittent high latency from this storage cluster. As a result some environments' provisioning times were longer than usual, and several environments' VMs failed to load the OS - rebooting these VMs typically solved this problem.
We continue to monitor and optimize the system and scheduled essential maintenance Friday May 2nd, 21:01 Pacific Time.

Cause

The incident was a result of faulty RAID controller in one of our storage clusters. Replacing the RAID controller and reinstalling brought the cluster back online, however we experienced high latency from this cluster. The higher latency caused unexpected impact on our virtualization infrastructure which resulted in long provisioning time and in some cases VMs couldn't load the operating system. We also identified a bug in one of our software components which customizes OS passwords during initial provisioning - this bug caused a very long provisioning time for a few environments as well.

Post Mortem
  • Storage maintenance was successful and the storage cluster is performing well
  • Software bug where operating system failed to load successfully was fixed
  • We are reviewing the memory storage cluster architecture and planning to roll out new architecture in the next few months
  • CloudShare Status Page - we launched a new service to better communicate incidents and maintenance with our customers. We encourage all of our customers to visit this page in order to get realtime up to date status of our service
  • We are implementing additional "guards" to ensure our virtualization infrastructure is more resilient to such incidents

We sincerely apologize for the disruption this may have caused and thank you for your patience.

 
 
 
 
Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk