On Thursday May 16th 2013, CloudShare experienced a performance degradation resulting in long environment provisioning time. The problem started at 4:30am PST when our NOC identified a high percentage of environment provisioning taking more than 10 minutes. Our engineering team immediately started troubleshooting the issue and identified that the problem resides on the mechanism that handles notifications between the different system components. During the troubleshooting we restarted several system components. At 6:30am PST, we identified the root cause and immediately started working on a fix. At 8am PST we deployed a fix for our Backend systems. The fix mitigated the problem, however the Backend systems had a long queue of pending jobs and events and the service went back to normal just before 10am PST.
We identified some software bugs:
- Service Bus was 'flooded' with redundant DNS updates
- CloudShare Backend systems did not handle the large amount of events in the Service Bus well
- The incident revealed several bugs in our Backend system. All critical bugs were fixed in production on Sunday 5/19.
- Additional software enhancements for both backend and frontend systems will be deployed in our next release Sunday, 6/2/2013
- New and improved service bus logs and monitors were added.
- Improve the procedure for handling high load after an incident to shorten the recovery time