Status Update

Update on the July mail server outage

On Friday 22nd July it was observed that the email platform was running at sub-optimal performance levels leading to occasional and intermittent login failures and delays to email delivery. The issue was caused by a transient problem with our storage provider’s platform.

Fasthosts engineers engaged in discussions with the supplier and monitored the system over the weekend to find the root cause of the issue. As usage increased on Monday morning we saw a further drop in performance and an increase in customer-impacting issues. Owing to the high-usage profile of the platform, any service issues, especially those affecting performance, can escalate quickly at peak times.

Mail queues were monitored and mitigating action was taken where possible. In situations like this, where the behaviour is quick to diagnose but the cause is unknown, we need to act in a cautious manner until we have investigated, as attempting a fix without a full report on the problem could result in further disruption for customers.

The solution

Our engineers traced the issue to a complex situation whereby deleted data held in specific high-performance memory regions was not being purged fast enough, which caused a gradual degradation in performance exacerbated by a high load. Configuration changes were made across the 72-node cluster to increase the frequency of the purge operation. These changes had to be made in a manner that safeguarded platform stability. This entailed taking each node out of service, changing the configuration and then re-introducing it back into the storage ring, a process which took approximately 8 hours.

Conclusion

Our team of engineers, working closely with our supplier, not only resolved the issue with full service resumed on the 28th July, but first impressions indicate that as a result, we have optimised the performance across the platform. We continue to monitor the performance closely and will review the data regularly.

We apologise to customers who were affected and hope that this provides some understanding into the issues encountered and how our engineers worked solidly to resolve it.

Be Sociable, Share!