CloudShare customers experienced a performance degradation on Wednesday March 28, 2012. The issue affected many customers who resumed their environments starting at 6:59 AM PST. The impacted system was rebooted at 7:10 AM PST and at 7:35 AM the mitigation measures returned the server to full capacity.
The performance degradation was a result of lengthy computation on one of our backend systems. This system manages part of our Cloud infrastructure and is also responsible for resuming and suspending environments. Due to the lengthy computations, environments which were resumed by this server took significantly longer time to complete.
Actions & Root Cause Analysis
A standard storage migration routine has unexpectedly caused one of our backend servers to generate lengthy computations and as a result the performance of the server was significantly degraded. Our datacenter engineers were alerted about the issue and started troubleshooting it. Once the issue was isolated and identified, we have restarted the impacted server and the problem was mitigated.
Our engineers have updated the code responsible for the migration routine to throttle the amount of concurrent migration requests handled at a time.
- 6:55 AM: storage migration routine started
- 6:59 AM: performance degradation started
- 7:02 AM: first performance alerts received
- 7:05 AM: engineers started troubleshooting the issue
- 7:10 AM: the impacted backend server was rebooted
- 7:35 AM: server warm up completed
- 7:52 AM: all pending environments were resumed
We sincerely apologize for any impact and disruption you may have experienced, and thank you for taking the time to read this report.
CloudShare is committed to continually and quickly improving our technology, service and operations to help prevent service disruptions and enhance our service.