HP Proliant DL380G6 unexpected reboot

Written by Ingmar Verheij on July 1st, 2011. Posted in XenServer

HP LogoCitrix XenServer 5.5.0 is installed on a HP Proliant DL380 G6 server. The customer complained that VM’s experienced unexpected shutdowns without finding the cause of the problem.

After spending some time troubleshooting on the virtual machines I couldn’t find the source of the problem. No BSOD on the VM’s, UPS showed no loss of power and there are no scheduled mechanismes that can cause the issue on the specified dates.

Eventually  I found that the cause of the issue is a bad driver as part of the HP Insight Manager Agent.

Symptoms

Usually after the weekend all servers on the same XenServer host are powered down resulting in a loss of functionality. One of the servers, the FS01, is the fileserver. So whenever the problem raised, the impact was clearly visible for users. The event log showed an event with event ID 6008 from the EventLog source describing an unexpected shutdown on 28-06-2011 at 04:04:48.

Unfortunately the (Nagios) monitoring server (MT03) was on the same server as the fileserver. Since no HA capabilities are available the server is never started and no administrator is warned.

XenCenter - VH04

Event 6008, Source EventLog: The previous system shutdown at 4:04:48 on 28-6-2011 was unexpected.

 

 

 

 

 

 

 

 

Caused by the host

Since all virtual machines are shutdown I expected this to be a problem caused by the host. First place to check would be the UPS. I knew there was a script active that would shutdown the VM’s and the Citrix XenServer when the power was lost for 5 minutes (300 seconds). Unfortunately no evidence was found that the power was lost. Since the power supply is redundant (and connected to two seperate UPS devices) this was more or less ruled out (although testing it is adviced).

I wanted to be check if the shutdown was issued by a command, which should be visible in the logfiles, and if the XenServer host would be shutdown aswell. So I started checking the logfile on the VH04 virtual host. Since I had a limited timeframe (around 04:04:48) this was to do in an acceptable time.

Citrix XenServer logfile are found in /var/log/messages and /var/log/xensource.

 

Citrix XenServer logfiles

The logfile showed no evidence of a shutdown command. Although I could find messages stating machines to reboot, but these where all issued from the VM itself (the terminal servers are rebooted at a 2-day schedule). No external command, for instance from the UPS, are found.

In /var/log.xensource.2, the logfile from 28-06-2011, I saw that the XenServer booted at 04:08:08.010 without any evidence of a shutdown procedure. The previous message in the logfile was a ‘regular’ message was logged at 04:04:59.079, a gap 3 minutes which is a lot in a XenServer logfile.

This implicated that the host server (VH04, Citrix XenServer) experienced an unexpected shutdown.Was the problem caused by a power failure?

 

Integrated Lights Out (iLO)

Since this is a HP Proliant server I could check the Integrated Lights Out (iLO) log files to check if a power failure was the cause of the problem.

The iLO2 Event Log showed an event around the same time the problems where caused. At 06/28/2011 04:05 an event was logged with the description ‘BMC IPMI Watchdog Timer Timeout: Action=System Power Reset.‘ followed by a reboot.

So the reboot is issued by the IPMI watchdog. But why is this reboot issued and how can I prevent it?

HP iLO - BMC IPMI Watchdog Timer Timeout: Action = System Power Reset.

 

HP Insight Manger logfiles

A HP Insight Manager Agent is installed on every virtualisation host. The software writes log information to /var/log/hp-health/hpasmd.log. This logfile showed no evidence of any event, all it shows is information about the startup process, so these logfiles are useless.

The /var/log/messages logfile does (however) show some information from the HP Insight Manager around the same timeframe as the reboot occurred. Although no event is found (the BMC IPMI Watchdog event that issued the reboot) there is an interesting message from the hpasmxld daemon ‘OsKcsExecCmd:  IPMI NetFN  0x4   CMD: 0x2d has timed out! ‘.

HP Business Support Center

In the HP Business Support Center there are some support documents describing issues with the message ‘OsKcsExecCmd’. Document c01330219 describes the following issue

“Advisory: ProLiant Servers with ILO 2 Running Red Hat Enterprise Linux 5 (AMD64/EM64T) May Intermittently ASR when a Certain System Health Application and Insight Management Agents and the OpenIPMI Device Driver Version is Installed”

The description matches the symptoms, but the HP System Health Application installed (8.2.5) is higher then speciied (7.90, 7.91 or 7.92). The issue should be solved by either faling back to the native driver or by upgrading to a newer version of the agent (and OpenIPMI driver).

Another support document (c01891068) described more problems with the OpenIPMI driver. Although the sympoms are different, the scope of the problem is HP OpenIPMI Device Driver for Linux Version 8.30 (or earlier).

VH04 - SSH - HP Health version

 

Solution (or workaround)

The solution (or workaround) to the problem is to either install the latest version of the HP OpenIPMI Device Driver for Linux OR to remove the HP OpenIPMI Device Driver for Linux and falling back to the native driver found in the Linux OS.

Removing the driver is described in HP Support Document c01833268. The steps involves stopping the HP Insight Management Agents, removing the driver and restarting the agents.

Another workaround could be to disable ASR (Automatic Server Recovery) completely. This would prevent a bad driver to initiate an unexpected reboot, but also when there is a valid reason to do so. Disabeling ASR (on a HP Proliand DL380 G5) is done via the BIOS.

Or when your OS in Linux you can disable it via the following command:

 

C-States

Although no known problems are found describing the same issue on Citrix XenServer 5.5, the cause of the problem can be a known issue with certain Intel Nehalem and Westmere processors.

Citrix article CTX127395  describes an issue where Citrix XenServer 5.6 freezes when certain processor features are enabled. Since the HP Proliant DL380G6 is equipped with an Intel E5540 processor, which is in scope for this problem, it is adviced to disable the C-states feature in the HP BIOS.

VH04 - SSH - Processor details

Disable C-States in HP BIOS (CTX127395)

 

 

 

 

 

 

 

Ingmar Verheij

Ingmar Verheij

At the time Ingmar wrote this article he worked for PepperByte as a Senior Consultant (up to May 2014). His work consisted of designing, migrating and troubleshooting Microsoft and Citrix infrastructures. He was working with technologies like Microsoft RDS, user environment management and (performance) monitoring. Ingmar is User Group leader of the Dutch Citrix User Group (DuCUG). RES Software named Ingmar RES Software Valued Professional in 2014.

More Posts - Website

Follow Me:
TwitterLinkedInGoogle Plus

Tags: , , ,

Trackback from your site.

Comments (2)

  • 28 May 2013 at 11:27 |

    I’m curious to find out what blog system you have been using? I’m having some minor security issues
    with my latest site and I’d like to find something more safeguarded. Do you have any suggestions?

Leave a comment

*

Donate

%d bloggers like this: