CloudShare customers experienced a performance degradation on Wednesday April 4, 2012. The issue affected customers who resumed their environments starting at 3:50 AM PST. Since the issue was triggered by a software hot-fix, we have initiated a roll-back at 4:58 AM PST, and at 6:05 AM the mitigation measures returned the server to full capacity.
The performance degradation was triggered by a software hot-fix which went wrong. A couple of high priority bugs which were introduced in our latest application release at 4/1/2012 were identified, tested and fixed. The fixes were activated on our production servers using CloudShare continuous deployment system which updates our backend servers without the need of a downtime. Deployment tests identified that the backend servers log files are not being updated, and we decided to initiate a roll-back, however the automatic roll-back failed. The log files issue was resolved and we successfully completed the post deployment tests. About an hour and a half later our monitoring system indicated slow performance of our backend servers and we noticed that environments are slow to resume. Since we couldn't identify the root cause we have decided to manually roll back the fixes we deployed earlier.
Actions & Root Cause Analysis
The initial automatic deployment procedure completed with no errors and downtime. The failed attempt to roll back the hot-fix due to the missing log files, was caused by a database caching issue. The performance issues which started a couple of hours later were in essence a result of the roll-back which failed. Once we have manually rolled-back the hot-fix our backend servers stabilized and standard performance was resumed.
- Roll-back bug was fixed
- We have added test cases for automatic deployments and roll-backs
- Hot-fixes were deployed successfully on Sunday 4/8
- 00:45 AM: hot-fix deployment started
- 01:13 AM: automatic deployment completed
- 01:27 AM: an issue was identified where backend server log files were updated
- 01:44 AM: initiated software roll-back
- 01:46 AM: software roll-back failed
- 01:52 AM: log file issue identified and fixed. Hot-fix deployment testing started
- 02:26 AM: production tests were completed successfully
- 03:50 AM: performance alerts received
- 03:57 AM: engineers started troubleshooting the issue
- 04:20 AM: a configuration issue identified and fixed, but performance was not improved
- 04:56 AM: initiating manual software roll-back to previous version
- 05:20 AM: server stabilizes, There is a large queue of pending jobs
- 06:08 AM: last pending job was completed
We sincerely apologize for any impact and disruption you may have experienced, and thank you for taking the time to read this report.
We are committed to implement the lessons learnt from this incident, and to take all the necessary steps to ensure such incident will not reoccur.