On Monday September 9th 2013, CloudShare experienced a performance degradation resulting in longer than usual environment provisioning time. The incident began around 4:30AM PDT when we detected that environment provisioning times were increasing. Our engineering team immediately began troubleshooting the issue and identified that all environments experiencing longer provisioning time were managed by a specific instance of a component of our backend system. We immediately disabled this instance and by 6:15AM PDT environment provisioning time went back to normal while we continued to investigate the root cause.
At 2:30PM PDT we noticed that environment provisioning was again slowly degrading. Further investigation revealed that one of the storage nodes in which we store the suspended environments memory state, had higher latency than other nodes. We reduced the load on this node and began working with the storage vendor to troubleshoot this issue. On September 10th 1:10AM PDT, we shut down the storage node, removed it from the pool and continued to troubleshoot it offline. At 5AM PDT the provisioning time returned to normal.
Due to the need to shut down the storage node in which the memory state is stored, a few customers environments were resumed by booting rather than by resuming to their previous memory state.
The incident was a result of performance issues in a storage array dedicated to suspended VMs state. A replacement disk which was added over the weekend created high latency while 'resilvering'. Jobs on this node took longer than usual and slowly started getting queued on our backend servers. After shutting down the storage node we disabled the new disk and added the node back to the storage pool.
- We continue to investigate and work with the storage vendor to understand the reason the new disk and the resilvering process created such an impact. This is a standard procedure which was performed several times before with no significant impact on provisioning performance
- We will be adding more memory storage nodes to the array in the near future
- We will make our backend system more resilient to degrading storage performance