HP LogoCitrix XenServer 5.5.0 is installed on a HP Proliant DL380 G6 server. The customer complained that VM’s experienced unexpected shutdowns without finding the cause of the problem.

After spending some time troubleshooting on the virtual machines I couldn’t find the source of the problem. No BSOD on the VM’s, UPS showed no loss of power and there are no scheduled mechanismes that can cause the issue on the specified dates.

Eventually  I found that the cause of the issue is a bad driver as part of the HP Insight Manager Agent.

Symptoms

Usually after the weekend all servers on the same XenServer host are powered down resulting in a loss of functionality. One of the servers, the FS01, is the fileserver. So whenever the problem raised, the impact was clearly visible for users. The event log showed an event with event ID 6008 from the EventLog source describing an unexpected shutdown on 28-06-2011 at 04:04:48.

Unfortunately the (Nagios) monitoring server (MT03) was on the same server as the fileserver. Since no HA capabilities are available the server is never started and no administrator is warned.

XenCenter - VH04

Event 6008, Source EventLog: The previous system shutdown at 4:04:48 on 28-6-2011 was unexpected.

 

 

 

 

 

 

 

 

Caused by the host

Since all virtual machines are shutdown I expected this to be a problem caused by the host. First place to check would be the UPS. I knew there was a script active that would shutdown the VM’s and the Citrix XenServer when the power was lost for 5 minutes (300 seconds). Unfortunately no evidence was found that the power was lost. Since the power supply is redundant (and connected to two seperate UPS devices) this was more or less ruled out (although testing it is adviced).

I wanted to be check if the shutdown was issued by a command, which should be visible in the logfiles, and if the XenServer host would be shutdown aswell. So I started checking the logfile on the VH04 virtual host. Since I had a limited timeframe (around 04:04:48) this was to do in an acceptable time.

Citrix XenServer logfile are found in /var/log/messages and /var/log/xensource.

 

Citrix XenServer logfiles

The logfile showed no evidence of a shutdown command. Although I could find messages stating machines to reboot, but these where all issued from the VM itself (the terminal servers are rebooted at a 2-day schedule). No external command, for instance from the UPS, are found.

In /var/log.xensource.2, the logfile from 28-06-2011, I saw that the XenServer booted at 04:08:08.010 without any evidence of a shutdown procedure. The previous message in the logfile was a ‘regular’ message was logged at 04:04:59.079, a gap 3 minutes which is a lot in a XenServer logfile.

[20110628 04:04:58.868|debug|VH04|14|Guest liveness monitor D:9bd3b429bfef|xapi_guest_metrics] Running liveness logic
[20110628 04:04:58.868|debug|VH04|14|Guest liveness monitor D:9bd3b429bfef|xapi_guest_metrics] Domain 1 thought to be live
[20110628 04:04:58.868|debug|VH04|14|Guest liveness monitor D:9bd3b429bfef|xapi_guest_metrics] Domain 2 thought to be live
[20110628 04:04:58.868|debug|VH04|14|Guest liveness monitor D:9bd3b429bfef|xapi_guest_metrics] Domain 3 thought to be dead
[20110628 04:04:58.868|debug|VH04|14|Guest liveness monitor D:9bd3b429bfef|xapi_guest_metrics] Marking as alive!
[20110628 04:04:59.072|debug|VH04|14|Guest liveness monitor D:9bd3b429bfef|xapi_guest_metrics] Done
[20110628 04:04:59.072|debug|VH04|14|Guest liveness monitor D:9bd3b429bfef|xapi_guest_metrics] Domain 4 thought to be dead
[20110628 04:04:59.072|debug|VH04|14|Guest liveness monitor D:9bd3b429bfef|xapi_guest_metrics] Domain 5 thought to be dead
[20110628 04:04:59.072|debug|VH04|14|Guest liveness monitor D:9bd3b429bfef|xapi_guest_metrics] Domain 23 thought to be dead
[20110628 04:04:59.072|debug|VH04|14|Guest liveness monitor D:9bd3b429bfef|xapi_guest_metrics] Marking as alive!
[20110628 04:04:59.079|debug|VH04|14|Guest liveness monitor D:9bd3b429bfef|xapi_guest_metrics] Done
[20110628 04:08:08.010|debug|VH04|0 thread_zero|server_init D:a3005f827e60|startup] task [Initing stunnel path]
[20110628 04:08:08.014|debug|VH04|0 thread_zero|server_init D:a3005f827e60|startup] task [XAPI SERVER STARTING]
[20110628 04:08:08.016|debug|VH04|0 thread_zero|XAPI SERVER STARTING D:6350113c63cc|xapi] xapi server starting; on_system_boot=true
[20110628 04:08:08.016|debug|VH04|0 thread_zero|server_init D:a3005f827e60|startup] task [Parsing inventory file]
[20110628 04:08:08.017|debug|VH04|0 thread_zero|server_init D:a3005f827e60|startup] task [Initialising local database]
[20110628 04:08:08.018|debug|VH04|0 thread_zero|Initialising local database D:c438812bab34|localdb] loaded currently_attached_pbds -> ('(\'OpaqueRef:041c11a7-8727-cac4-4b8e-16bf49cb4fe4\')' '(\'OpaqueRef:e3604488-1716-b1cd-930b-edcfc108bc97\')' '(\'OpaqueRef:d50f9679-8cd5-2edf-ec7d-e12957b6540a\')' '(\'OpaqueRef:1b156a5f-049d-4b2c-3bd6-c0a59066519c\')' '(\'OpaqueRef:f1f8a7b3-fd17-6088-d276-ab0d571e99fe\')' '(\'OpaqueRef:c7448321-75dc-1a44-e1b1-6233498f0ecd\')' '(\'OpaqueRef:5622bc48-52d8-62b1-43ea-2509de9b5a4e\')' '(\'OpaqueRef:351b81d9-5021-c2ad-6bff-d36cdc937752\')' '(\'OpaqueRef:9d1e7380-76ed-5071-0e5d-21b5b4c2f941\')' '(\'OpaqueRef:4a2bbc3e-2953-cbab-2767-6c506328d55b\')' '(\'OpaqueRef:051184e7-3f94-4a95-cc0d-069f9f56a92c\')' '(\'OpaqueRef:746a7430-93ef-3a7f-d9b9-797109bcddb3\')' '(\'OpaqueRef:4845b673-0548-c130-a54a-7ca9be46e63d\')')
[20110628 04:08:08.019|debug|VH04|0 thread_zero|Initialising local database D:c438812bab34|localdb] loaded host_disabled_until_reboot -> false
[20110628 04:08:08.019|debug|VH04|0 thread_zero|Initialising local database D:c438812bab34|localdb] loaded this_node_just_became_master -> false
[20110628 04:08:08.019|debug|VH04|0 thread_zero|Initialising local database D:c438812bab34|localdb] loaded host_restarted_cleanly -> false
[20110628 04:08:08.019|debug|VH04|0 thread_zero|Initialising local database D:c438812bab34|localdb] loaded master_scripts -> false
[20110628 04:08:08.019|debug|VH04|0 thread_zero|Initialising local database D:c438812bab34|localdb] loaded ha.armed -> false
[20110628 04:08:08.019|debug|VH04|0 thread_zero|Initialising local database D:c438812bab34|xapi] ha.armed = false
[20110628 04:08:08.020|debug|VH04|0 thread_zero|server_init D:a3005f827e60|startup] task [Reading pool secret]
[20110628 04:08:08.026|debug|VH04|0 thread_zero|server_init D:a3005f827e60|startup] task [Logging xapi version info]
[20110628 04:08:08.028|debug|VH04|0 thread_zero|Logging xapi version info D:523d18dfdb50|xapi] Server configuration:
[20110628 04:08:08.028|debug|VH04|0 thread_zero|Logging xapi version info D:523d18dfdb50|xapi] product_version: 5.5.0
[20110628 04:08:08.028|debug|VH04|0 thread_zero|Logging xapi version info D:523d18dfdb50|xapi] product_brand: XenServer
[20110628 04:08:08.028|debug|VH04|0 thread_zero|Logging xapi version info D:523d18dfdb50|xapi] build_number: 25727p
[20110628 04:08:08.028|debug|VH04|0 thread_zero|Logging xapi version info D:523d18dfdb50|xapi] hg changeset: 0e5cce9f5289
[20110628 04:08:08.028|debug|VH04|0 thread_zero|Logging xapi version info D:523d18dfdb50|xapi] version: 1.3
[20110628 04:08:08.028|debug|VH04|0 thread_zero|Logging xapi version info D:523d18dfdb50|xapi] DB schema path: /etc/xensource/db_schema.sql

This implicated that the host server (VH04, Citrix XenServer) experienced an unexpected shutdown.Was the problem caused by a power failure?

 

Integrated Lights Out (iLO)

Since this is a HP Proliant server I could check the Integrated Lights Out (iLO) log files to check if a power failure was the cause of the problem.

The iLO2 Event Log showed an event around the same time the problems where caused. At 06/28/2011 04:05 an event was logged with the description ‘BMC IPMI Watchdog Timer Timeout: Action=System Power Reset.‘ followed by a reboot.

So the reboot is issued by the IPMI watchdog. But why is this reboot issued and how can I prevent it?

HP iLO - BMC IPMI Watchdog Timer Timeout: Action = System Power Reset.

 

HP Insight Manger logfiles

A HP Insight Manager Agent is installed on every virtualisation host. The software writes log information to /var/log/hp-health/hpasmd.log. This logfile showed no evidence of any event, all it shows is information about the startup process, so these logfiles are useless.

The /var/log/messages logfile does (however) show some information from the HP Insight Manager around the same timeframe as the reboot occurred. Although no event is found (the BMC IPMI Watchdog event that issued the reboot) there is an interesting message from the hpasmxld daemon ‘OsKcsExecCmd:  IPMI NetFN  0x4   CMD: 0x2d has timed out! ‘.

Jun 28 04:04:32 VH04 snmpd[6458]: Received SNMP packet(s) from UDP: [127.0.0.1]:46486 
Jun 28 04:04:36 VH04 hpasmxld[8718]: OsKcsExecCmd:  IPMI NetFN  0x4   CMD: 0x2d has timed out! 
Jun 28 04:04:46 VH04 hpasmxld[8718]: OsKcsExecCmd:  IPMI NetFN  0x4   CMD: 0x2d has timed out! 
Jun 28 04:04:49 VH04 snmpd[6458]: Received SNMP packet(s) from UDP: [127.0.0.1]:55169 
Jun 28 04:04:56 VH04 hpasmxld[8718]: OsKcsExecCmd:  IPMI NetFN  0x4   CMD: 0x2d has timed out! 
Jun 28 04:05:04 VH04 snmpd[6458]: Received SNMP packet(s) from UDP: [127.0.0.1]:36592 
Jun 28 04:05:06 VH04 hpasmxld[8718]: OsKcsExecCmd:  IPMI NetFN  0x4   CMD: 0x2d has timed out! 
Jun 28 04:07:58 VH04 syslogd 1.4.1: restart.
Jun 28 04:07:58 VH04 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Jun 28 04:07:58 VH04 kernel: abled.
Jun 28 04:07:58 VH04 kernel:   MEM window: disabled.
Jun 28 04:07:58 VH04 kernel:   PREFETCH window: disabled.
Jun 28 04:07:58 VH04 kernel: PCI: Bridge: 0000:00:06.0
Jun 28 04:07:58 VH04 kernel:   IO window: disabled.
Jun 28 04:07:58 VH04 kernel:   MEM window: disabled.
Jun 28 04:07:58 VH04 kernel:   PREFETCH window: disabled.

HP Business Support Center

In the HP Business Support Center there are some support documents describing issues with the message ‘OsKcsExecCmd’. Document c01330219 describes the following issue

“Advisory: ProLiant Servers with ILO 2 Running Red Hat Enterprise Linux 5 (AMD64/EM64T) May Intermittently ASR when a Certain System Health Application and Insight Management Agents and the OpenIPMI Device Driver Version is Installed”

The description matches the symptoms, but the HP System Health Application installed (8.2.5) is higher then speciied (7.90, 7.91 or 7.92). The issue should be solved by either faling back to the native driver or by upgrading to a newer version of the agent (and OpenIPMI driver).

Another support document (c01891068) described more problems with the OpenIPMI driver. Although the sympoms are different, the scope of the problem is HP OpenIPMI Device Driver for Linux Version 8.30 (or earlier).

VH04 - SSH - HP Health version

 

Solution (or workaround)

The solution (or workaround) to the problem is to either install the latest version of the HP OpenIPMI Device Driver for Linux OR to remove the HP OpenIPMI Device Driver for Linux and falling back to the native driver found in the Linux OS.

Removing the driver is described in HP Support Document c01833268. The steps involves stopping the HP Insight Management Agents, removing the driver and restarting the agents.

Service hp-snmp-agents stop. 
Service hp-health stop 
Service hp-ilo stop 
Service hpsnmpd stop 

rpm -e hp-OpenIPMI 

Service hpsnmpd start 
Service hp-ilo start 
Service hp-health start 
Service hp-snmp-agents start.

Another workaround could be to disable ASR (Automatic Server Recovery) completely. This would prevent a bad driver to initiate an unexpected reboot, but also when there is a valid reason to do so. Disabeling ASR (on a HP Proliand DL380 G5) is done via the BIOS.

Reboot the server.
Press F9 during POST to enter the RBSU.
Select 'Server Availablility' and press Enter.
Select 'ASR Status'and press Enter.
Select 'Disabled'and press Enter.
Press Esc to close menu
Press Esc to exit utility.
Press F10 to confirm exit and save changes.
Server will automatically reboot.

Or when your OS in Linux you can disable it via the following command:

hpasmcli -s "disable asr" 

 

C-States

Although no known problems are found describing the same issue on Citrix XenServer 5.5, the cause of the problem can be a known issue with certain Intel Nehalem and Westmere processors.

Citrix article CTX127395  describes an issue where Citrix XenServer 5.6 freezes when certain processor features are enabled. Since the HP Proliant DL380G6 is equipped with an Intel E5540 processor, which is in scope for this problem, it is adviced to disable the C-states feature in the HP BIOS.

VH04 - SSH - Processor details

Disable C-States in HP BIOS (CTX127395)

 

 

 

 

 

 

 

Ingmar Verheij

2 Reacties

  1. I’m curious to find out what blog system you have been using? I’m having some minor security issues
    with my latest site and I’d like to find something more safeguarded. Do you have any suggestions?

Geef een reactie

Het e-mailadres wordt niet gepubliceerd. Vereiste velden zijn gemarkeerd met *

Deze site gebruikt Akismet om spam te verminderen. Bekijk hoe je reactie-gegevens worden verwerkt.

nl_NLNederlands