CloudShare customers experienced a service disruption and performance degradation on Monday July 2nd, 2012. The issues mainly affected customers who created or resumed their environments starting 2:10 AM PST. The main remedy was implemented within less than 2 hours, however our system recovered slowly and due to subsequent events the standard performance was fully restored at 8:50 AM PST.
The performance degradation was caused by an overloaded database resulting in high computation times on our backend servers which manage our cloud. In order to reduce the load on our database we reduced the frequency of our software heartbeats (the heartbeats check the status and health of all running environments and VMs) to 30 seconds. Within minutes the database load went back to normal, however our backend servers had many jobs queued and was slow to recover. During this time we encountered another issue where a backend process responsible to run scripts on the guest OS started crashing resulting in high CPU of the backend servers. As soon as the issue was solved the backend started recovering again.
We confirmed that the issue was indeed triggered by the frequency of our heartbeats, however we believe that the heartbeats only triggers the database overloading.
- Heartbeat frequency will be maintained at 30 seconds until further analysis and improvements
- Heartbeat database queries will be reduced and optimized
- Backend enhancements to better handle heavy loads, and recover faster
- Review database architecture
- Replace the engine responsible to run scripts on guest OS
- 02:10 AM: long environment preparation alert received
- 02:15 AM: NOC starts investigation
- 02:23 AM: NOC escalates the issue
- 04:02 AM: heartbeats frequency reduced to 30 seconds
- 05:07 AM: backend load stabilizes
- 06:30 AM: backend CPU at 100% due to crashes of process responsible for running scripts on guest machines
- 08:30 AM: service resumed to normal operation and performance
We sincerely apologize for any impact and disruption you may have experienced, and thank you for taking the time to read this report.
We are committed to implement the lessons learnt from this incident, and to take all the necessary steps to ensure such incident will not reoccur.