Virtual (SQL)server impaired by residual snapshot after Veeam backup

Recently I had to troubleshoot a SQL server that performed nightly batch jobs for a management information system. Under normal conditions this required 6.5  hours but this was suddenly increased to 11.5 hours. An increase of 75%!

Because of this delay the information wasn’t presented on time with a lot of implications.  Several departments where asked what has changed in the past days, of course the answer was “nothing”.

Point in time

The delay of the batch was introduced since the 6th of June (increasing from 400 to 600+ minutes):

 

VMware vSphere Client

Performance

The performance metrics of the virtual machines showed a decrease in both processor and disk performance while the network was hardly affected.

This is unexpected since the content of the batch job is unchanged, and the same applies for the infrastructure. No (major) changes are executed that justify the decrease in performance

Storage

There was a sudden increase (of ~ 600GB) in allocated disk space, with a substantial amount for snapshots. Aha!

Snapshot

Unless a change is performed (and a rollback is required) no snapshot should be present. However there was a snapshot called “Consolidate Helper- 0” .

This snapshot was residual after a failed Veeam backup (as described Jim Jones in this article).

 

Veeam Backup & Replication

To verify that the snapshot indeed was a leftover of a failed backup I verified the backup log. And indeed, after performing a successful backup on the 4th the backup of 5th of june ended with a warning:

Removing snapshot
Unable to connect to the remote server 
No connection could be made because the target machine actively refused it xxx.xxx.xxx.xxx:443 
Veeam Backup will attempt to remove snapshot during the next job cycle, but you may consider removing snapshot manually. 
Possible causes for snapshot removal failure: 
- Network connectivity issue, or vCenter Server is too busy to serve the request 
- ESX host was unable to process snapshot removal request in a timely manner 
- Snapshot was already removed by another application

The backup on the 6th of june could not be completed at all and ended with an error:

Initializing target session
RemoveSnapshot failed, snapshotRef "snapshot-35436", timeout "3600000" 
Unable to access file  since it is locked

 

Result

After removing the snapshot the storage space was reclaimed


and the time required to perform the batch job was back to normal

 

Moral of the story

Be careful with snapshots of virtual machines. The impact on the performance can be dramatic and the time-to-fix can be quite a while if you’re unaware of this.

 

More information :

Geef een reactie

Het e-mailadres wordt niet gepubliceerd. Vereiste velden zijn gemarkeerd met *

Deze site gebruikt Akismet om spam te verminderen. Bekijk hoe je reactie-gegevens worden verwerkt.

nl_NLNederlands