Service disruption - long environment provisioning, Thursday 5/16/2013

On Thursday May 16th 2013, CloudShare experienced a performance degradation resulting in long environment provisioning time. The problem started at 4:30am PST when our NOC identified a high percentage of environment provisioning taking more than 10 minutes. Our engineering team immediately started troubleshooting the issue and identified that the problem resides on the mechanism that handles notifications between the different system components. During the troubleshooting we restarted several system components. At 6:30am PST, we identified the root cause and immediately started working on a fix. At 8am PST we deployed a fix for our Backend systems. The fix mitigated the problem, however the Backend systems had a long queue of pending jobs and events and the service went back to normal just before 10am PST.



We identified some software bugs:

  • Service Bus was 'flooded' with redundant DNS updates
  • CloudShare Backend systems did not handle the large amount of events in the Service Bus well
These bugs caused a significant delay in handling new events, thus the Backend system started processing each event with a few minutes delay.
After the hot fix was deployed, due to the long queue of pending environments, there was a high load on our memory storage clusters, which also impacted the full recovery time.

Post Mortem
  • The incident revealed several bugs in our Backend system. All critical bugs were fixed in production on Sunday 5/19. 
  • Additional software enhancements for both backend and frontend systems will be deployed in our next release Sunday, 6/2/2013
  • New and improved service bus logs and monitors were added.
  • Improve the procedure for handling high load after an incident to shorten the recovery time

We apologize for any inconvenience this issue may have caused. 
We take our SLA seriously and we know that you as a customer rely on our services. It is our goal to always provide you with the highest quality of service possible, and to continue and improve our service.
Was this article helpful?
3 out of 3 found this helpful
Have more questions? Submit a request


Powered by Zendesk