Debugging XC130003 FDCs with MQS_ACTION_ON_EXCEPTION - Middleware News
Problem(Abstract)
You have an application that fails with an FDC from the xehExceptionHandler component, this happens because it received a signal like SIGBUS or SIGSEGV. You want to know how to find the cause of the failure with the MQS_ACTION_ON_EXCEPTION environment variable.
Cause
The MQ xehExceptionHandler function catches synchronous terminating signals such as SIGBUS (bus error), SIGSEGV (segmentation fault), SIGILL (illegal instruction) and SIGFPE (floating point error). These signals indicate a serious fault, such as dereferencing a NULL pointer, which needs to be investigated by the program developer.
Resolving the problem
The WebSphere MQ Application Programming Guide describes how WebSphere MQ handles signals on UNIX systems. The MQS_ACTION_ON_EXCEPTION environment variable controls what the MQ signal handler will do when it is called. MQS_ACTION_ON_EXCEPTION must be assigned one of the following four values:
HANG
Hang the process if MQ caused the fault
HANG_ALL
Hang the process no matter who is at fault
ABORT
Terminate and dump core if MQ caused the fault
ABORT_ALL
Terminate and dump core no matter who is at fault
Each of these four values can be followed by a plus sign, for example HANG_ALL+. The plus sign forces the handler to act even when MQ expects a signal, so you should not use it unless requested by IBM support. In most cases you should use the following value:
export MQS_ACTION_ON_EXCEPTION=HANG_ALL
You should set this variable in the environment where you start your queue manager and any failing applications. Monitor the /var/mqm/errors directory for a new FDC file from the xehExceptionHandler component, often with Probe Id XC130003 and a Comment1 value showing SIGBUS or SIGSEGV. When you see a matching FDC, the offending program will be waiting in an idle state for you to analyze it. You can do so with a debugger such as dbx or gdb, or on some systems using the IBM stackit script.
The HANG_ALL option tells MQ to stop the process no matter whether it was MQ or application code that failed. This is perhaps the most useful option since it allows you to capture debugging information for IBM support if MQ is at fault or for your own developers if your application code failed.
If you prefer to get a core dump for analysis later choose the value ABORT_ALL instead. The core dump will only be created if the directory permissions support write and "ulimit -c" size is > 0.
Exception Handler Diagnostic Walkthrough
Here is an FDC file that was created when a client application tried to connect to a queue manager running on AIX:
+---------------------------------------------------------+
| |
| WebSphere MQ First Failure Symptom Report |
| ========================================= |
| |
| Date/Time :- Tuesday June 12 15:32:04 EDT 2007 |
| Host Name :- aemaix4 (AIX 5.2) |
| PIDS :- 5724H7201 |
| LVLS :- 6.0.1.1 |
| Product Long Name :- WebSphere MQ for AIX |
| Vendor :- IBM |
| Probe Id :- XC130003 |
| Application Name :- MQM |
| Component :- xehExceptionHandler |
| SCCS Info :- lib/cs/unix/amqxerrx.c, 1.214.1.4 |
| Line Number :- 1341 |
| Build Date :- May 5 2006 |
| CMVC level :- p600-101-060504 |
| Build Type :- IKAP - (Production) |
| UserID :- 00007100 (jtf) |
| Program Name :- amqrmppa |
| Addressing mode :- 64-bit |
| Process :- 311466 |
| Thread :- 4 |
| QueueManager :- JTF |
| ConnId(1) IPCC :- 40 |
| Major Errorcode :- STOP |
| Minor Errorcode :- OK |
| Probe Type :- HALT6109 |
| Probe Severity :- 1 |
| Probe Description :- AMQ6109: An internal WebSphere MQ |
| error has occurred. |
| FDCSequenceNumber :- 0 |
| Arith1 :- 11 b |
| Comment1 :- SIGSEGV: address not mapped(539) |
| |
+---------------------------------------------------------+
MQM Function Stack
ccxResponder
rrxResponder
rriAcceptSess
rriInitExits
rriCALL_EXIT
xcsFFST
This FDC shows both the XC130003 Probe Id and the xehExceptionHandler component name. and the comment at the bottom of the header shows this process received a SIGSEGV signal. The amqrmppa program is responsible for handling channel activity, and the functions listed in the MQM Function Stack show the failing thread was responding to an incoming channel request.
Since amqrmppa is one of the MQ queue manager processes, most problems within it need to be investigated by IBM support. In this case, the functions rriInitExits and rriCALL_EXIT suggest this channel failed while calling a channel exit, which is typically not provided by IBM. At the bottom of the FDC some of the channel details are printed, including its name (IPVAUTH.SVRCONN) and a security exit name.
The MQS_ACTION_ON_EXCEPTION=HANG_ALL variable was set before this queue manager was started, so after writing the FDC file the MQ signal handler suspended this amqrmppa process. The FDC shows it is running under process identifier 311466, so it would be possible to attach a debugger to this process. Alternatively, running 'stackit -p 311466' provides the following information:
Running stackit against process 311466 Tue Jun 12 15:32:43 2007
=================================================================
UID PID PPID C STIME TTY TIME CMD
jtf 311466 4202 0 15:32:03 - 0:00 /usr/mqm/bin/amqrmppa -m JTF
Stack Trace
311466: /usr/mqm/bin/amqrmppa -m JTF
---------- tid# 2982001 ----------
0x0900000001603a60 cccJobMonitor() + ??
0x090000000160184c rppPoolMain() + 0x44c
0x0000000100000490 UnknownData() + 0xffffffff00000298
0x0000000100000288 __start() + 0x90
---------- tid# 3428471 ----------
0x090000000002b688 _ptrgl() + ??
0x090000000003b334 nsleep(??, ??) + 0xac
0x0900000000046be4 sleep(??) + 0x58
0x090000000117ca00 xcsSleep() + 0x90
0x0900000001128f0c xehInterpretSavedSigaction() + 0x9c
0x090000000112e1dc xehExceptionHandler() + 0xb1c
0x0900000004bd3504 ipvQueryLdapServer() + 0x10
0x07000000500007dc ????????(??, ??) + ??
0x0900000004bd3588 ipvAnalyzeAddress() + 0xc
0x0900000004bd3328 ipvAuth(0x110099f10, 0x1100c79b8...
0x0900000001641f50 rriInitExit() + 0x250
0x090000000164d87c rriInitExits() + 0xf3c
0x090000000172b7e4 rriAcceptSess() + 0x1b34
0x0900000001729334 rrxResponder() + 0x164
0x09000000015f1aa0 ccxResponder() + 0x190
0x09000000015ee070 cciResponderThread() + 0xc0
0x090000000112f1d4 ThreadMain() + 0xd44
0x090000000036e508 _pthread_body(??) + 0xbc
...
Address Space Map
311466 : /usr/mqm/bin/amqrmppa -m JTF
100000000 2K read/exec amqrmppa
110000540 0K read/write amqrmppa
9fffffff0000000 33K read/exec /usr/ccs/bin/usla64
9fffffff00084a7 0K read/write /usr/ccs/bin/usla64
900000004bd3000 4K read/exec /scratch/ipverify/ipverify
8001000a001ac08 0K read/write /scratch/ipverify/ipverify
900000001899000 86K read/exec /usr/mqm/lib64/amqcltca_r
...
The stack information printed here is more detailed since it includes operating system functions, not just those functions belonging to WebSphere MQ. Unlike the stack in the FDC file, this one reads from the bottom up. It shows that the MQ rriInitExit function passed control to the ipvAuth function, which is the security exit function configured on the channel:
display channel(IPVAUTH.SVRCONN)
1 : dis chl(IPVAUTH.SVRCONN)
AMQ8414: Display Channel details.
CHANNEL(IPVAUTH.SVRCONN) CHLTYPE(SVRCONN)
ALTDATE(2007-06-12) ALTTIME(15.28.52)
COMPHDR(NONE) COMPMSG(NONE)
DESCR( ) HBINT(300)
KAINT(AUTO) MAXMSGL(4194304)
MCAUSER( ) MONCHL(OFF)
RCVDATA( ) RCVEXIT( )
SCYDATA( )
SCYEXIT(/scratch/ipverify/ipverify(ipvAuth))
SENDDATA( ) SENDEXIT( )
SSLCAUTH(REQUIRED) SSLCIPH( )
SSLPEER( ) TRPTYPE(TCP)
Because this error happened inside the security exit function, the developer of the exit needs to investigate. Fortunately, the stack provides more information to debug the security exit. It shows that the ipvAuth function called ipvAnalyzeAddress, which indirectly called ipvQueryLdapServer. This is the last function before the signal was raised, so it appears the ipvQueryLdapServer function is at fault. The stackit output also confirms that the ipvQueryLdapSever function address is inside the /scratch/ipverify/ipverify module.
A developer could use a debugger to investigate more closely, or perhaps add debugging or logging code to help identify the fault. Sometimes the problem is easy to see when investigating the code. In this case the failure was due to some intentionally bad code added to the exit. The following statement tries to store the integer value 42 at address 0x0000000000000539 (the hex value for decimal 1337):
int *badaddr = (int *)1337;
*badaddr = 42;
Looking back at the FDC header, the comment not only identifies the fault as a SIGSEGV, but it also gives the address that caused the failure: "SIGSEGV: address not mapped(539)". If the exit had instead dereferenced a NULL pointer, the comment would have read "SIGSEGV: address not mapped(0)", at least on AIX. Due to differences between operating systems the comment can vary a bit.
In order to clean up the problem it is first necessary to stop the queue manager. Because this amqrmppa process is in a perpetual hang, endmqm won't complete successfully until it is killed, for example by running 'kill -9 311466'. Once the queue manager and its listener have stopped, the rebuilt exit program should be installed. On AIX root must run the command 'slibclean' to flush the old exit code out of the kernel library cache. Finally, after restarting the queue manager, the new exit is able to handle incoming client connections without any problem:
aemaix4> amqsputc TEST JTF
Sample AMQSPUT0 start
target queue is TEST
Hello, world.
While this sample client application is active, runmqsc shows the SVRCONN channel is running:
display chstatus(IPVAUTH.SVRCONN)
1 : dis chs(IPVAUTH.SVRCONN)
AMQ8417: Display Channel Status details.
CHANNEL(IPVAUTH.SVRCONN) CHLTYPE(SVRCONN)
CONNAME(127.0.0.1) CURRENT
RQMNAME( ) STATUS(RUNNING)
SUBSTATE(RECEIVE) XMITQ( )
These tests and the absence of any new FDC files confirms that the correction to the channel exit solved the SIGSEGV exception in the WebSphere MQ channel process.
Problem(Abstract)
You have an application that fails with an FDC from the xehExceptionHandler component, this happens because it received a signal like SIGBUS or SIGSEGV. You want to know how to find the cause of the failure with the MQS_ACTION_ON_EXCEPTION environment variable.
Cause
The MQ xehExceptionHandler function catches synchronous terminating signals such as SIGBUS (bus error), SIGSEGV (segmentation fault), SIGILL (illegal instruction) and SIGFPE (floating point error). These signals indicate a serious fault, such as dereferencing a NULL pointer, which needs to be investigated by the program developer.
Resolving the problem
The WebSphere MQ Application Programming Guide describes how WebSphere MQ handles signals on UNIX systems. The MQS_ACTION_ON_EXCEPTION environment variable controls what the MQ signal handler will do when it is called. MQS_ACTION_ON_EXCEPTION must be assigned one of the following four values:
HANG
Hang the process if MQ caused the fault
HANG_ALL
Hang the process no matter who is at fault
ABORT
Terminate and dump core if MQ caused the fault
ABORT_ALL
Terminate and dump core no matter who is at fault
Each of these four values can be followed by a plus sign, for example HANG_ALL+. The plus sign forces the handler to act even when MQ expects a signal, so you should not use it unless requested by IBM support. In most cases you should use the following value:
export MQS_ACTION_ON_EXCEPTION=HANG_ALL
You should set this variable in the environment where you start your queue manager and any failing applications. Monitor the /var/mqm/errors directory for a new FDC file from the xehExceptionHandler component, often with Probe Id XC130003 and a Comment1 value showing SIGBUS or SIGSEGV. When you see a matching FDC, the offending program will be waiting in an idle state for you to analyze it. You can do so with a debugger such as dbx or gdb, or on some systems using the IBM stackit script.
The HANG_ALL option tells MQ to stop the process no matter whether it was MQ or application code that failed. This is perhaps the most useful option since it allows you to capture debugging information for IBM support if MQ is at fault or for your own developers if your application code failed.
If you prefer to get a core dump for analysis later choose the value ABORT_ALL instead. The core dump will only be created if the directory permissions support write and "ulimit -c" size is > 0.
Exception Handler Diagnostic Walkthrough
Here is an FDC file that was created when a client application tried to connect to a queue manager running on AIX:
+---------------------------------------------------------+
| |
| WebSphere MQ First Failure Symptom Report |
| ========================================= |
| |
| Date/Time :- Tuesday June 12 15:32:04 EDT 2007 |
| Host Name :- aemaix4 (AIX 5.2) |
| PIDS :- 5724H7201 |
| LVLS :- 6.0.1.1 |
| Product Long Name :- WebSphere MQ for AIX |
| Vendor :- IBM |
| Probe Id :- XC130003 |
| Application Name :- MQM |
| Component :- xehExceptionHandler |
| SCCS Info :- lib/cs/unix/amqxerrx.c, 1.214.1.4 |
| Line Number :- 1341 |
| Build Date :- May 5 2006 |
| CMVC level :- p600-101-060504 |
| Build Type :- IKAP - (Production) |
| UserID :- 00007100 (jtf) |
| Program Name :- amqrmppa |
| Addressing mode :- 64-bit |
| Process :- 311466 |
| Thread :- 4 |
| QueueManager :- JTF |
| ConnId(1) IPCC :- 40 |
| Major Errorcode :- STOP |
| Minor Errorcode :- OK |
| Probe Type :- HALT6109 |
| Probe Severity :- 1 |
| Probe Description :- AMQ6109: An internal WebSphere MQ |
| error has occurred. |
| FDCSequenceNumber :- 0 |
| Arith1 :- 11 b |
| Comment1 :- SIGSEGV: address not mapped(539) |
| |
+---------------------------------------------------------+
MQM Function Stack
ccxResponder
rrxResponder
rriAcceptSess
rriInitExits
rriCALL_EXIT
xcsFFST
This FDC shows both the XC130003 Probe Id and the xehExceptionHandler component name. and the comment at the bottom of the header shows this process received a SIGSEGV signal. The amqrmppa program is responsible for handling channel activity, and the functions listed in the MQM Function Stack show the failing thread was responding to an incoming channel request.
Since amqrmppa is one of the MQ queue manager processes, most problems within it need to be investigated by IBM support. In this case, the functions rriInitExits and rriCALL_EXIT suggest this channel failed while calling a channel exit, which is typically not provided by IBM. At the bottom of the FDC some of the channel details are printed, including its name (IPVAUTH.SVRCONN) and a security exit name.
The MQS_ACTION_ON_EXCEPTION=HANG_ALL variable was set before this queue manager was started, so after writing the FDC file the MQ signal handler suspended this amqrmppa process. The FDC shows it is running under process identifier 311466, so it would be possible to attach a debugger to this process. Alternatively, running 'stackit -p 311466' provides the following information:
Running stackit against process 311466 Tue Jun 12 15:32:43 2007
=================================================================
UID PID PPID C STIME TTY TIME CMD
jtf 311466 4202 0 15:32:03 - 0:00 /usr/mqm/bin/amqrmppa -m JTF
Stack Trace
311466: /usr/mqm/bin/amqrmppa -m JTF
---------- tid# 2982001 ----------
0x0900000001603a60 cccJobMonitor() + ??
0x090000000160184c rppPoolMain() + 0x44c
0x0000000100000490 UnknownData() + 0xffffffff00000298
0x0000000100000288 __start() + 0x90
---------- tid# 3428471 ----------
0x090000000002b688 _ptrgl() + ??
0x090000000003b334 nsleep(??, ??) + 0xac
0x0900000000046be4 sleep(??) + 0x58
0x090000000117ca00 xcsSleep() + 0x90
0x0900000001128f0c xehInterpretSavedSigaction() + 0x9c
0x090000000112e1dc xehExceptionHandler() + 0xb1c
0x0900000004bd3504 ipvQueryLdapServer() + 0x10
0x07000000500007dc ????????(??, ??) + ??
0x0900000004bd3588 ipvAnalyzeAddress() + 0xc
0x0900000004bd3328 ipvAuth(0x110099f10, 0x1100c79b8...
0x0900000001641f50 rriInitExit() + 0x250
0x090000000164d87c rriInitExits() + 0xf3c
0x090000000172b7e4 rriAcceptSess() + 0x1b34
0x0900000001729334 rrxResponder() + 0x164
0x09000000015f1aa0 ccxResponder() + 0x190
0x09000000015ee070 cciResponderThread() + 0xc0
0x090000000112f1d4 ThreadMain() + 0xd44
0x090000000036e508 _pthread_body(??) + 0xbc
...
Address Space Map
311466 : /usr/mqm/bin/amqrmppa -m JTF
100000000 2K read/exec amqrmppa
110000540 0K read/write amqrmppa
9fffffff0000000 33K read/exec /usr/ccs/bin/usla64
9fffffff00084a7 0K read/write /usr/ccs/bin/usla64
900000004bd3000 4K read/exec /scratch/ipverify/ipverify
8001000a001ac08 0K read/write /scratch/ipverify/ipverify
900000001899000 86K read/exec /usr/mqm/lib64/amqcltca_r
...
The stack information printed here is more detailed since it includes operating system functions, not just those functions belonging to WebSphere MQ. Unlike the stack in the FDC file, this one reads from the bottom up. It shows that the MQ rriInitExit function passed control to the ipvAuth function, which is the security exit function configured on the channel:
display channel(IPVAUTH.SVRCONN)
1 : dis chl(IPVAUTH.SVRCONN)
AMQ8414: Display Channel details.
CHANNEL(IPVAUTH.SVRCONN) CHLTYPE(SVRCONN)
ALTDATE(2007-06-12) ALTTIME(15.28.52)
COMPHDR(NONE) COMPMSG(NONE)
DESCR( ) HBINT(300)
KAINT(AUTO) MAXMSGL(4194304)
MCAUSER( ) MONCHL(OFF)
RCVDATA( ) RCVEXIT( )
SCYDATA( )
SCYEXIT(/scratch/ipverify/ipverify(ipvAuth))
SENDDATA( ) SENDEXIT( )
SSLCAUTH(REQUIRED) SSLCIPH( )
SSLPEER( ) TRPTYPE(TCP)
Because this error happened inside the security exit function, the developer of the exit needs to investigate. Fortunately, the stack provides more information to debug the security exit. It shows that the ipvAuth function called ipvAnalyzeAddress, which indirectly called ipvQueryLdapServer. This is the last function before the signal was raised, so it appears the ipvQueryLdapServer function is at fault. The stackit output also confirms that the ipvQueryLdapSever function address is inside the /scratch/ipverify/ipverify module.
A developer could use a debugger to investigate more closely, or perhaps add debugging or logging code to help identify the fault. Sometimes the problem is easy to see when investigating the code. In this case the failure was due to some intentionally bad code added to the exit. The following statement tries to store the integer value 42 at address 0x0000000000000539 (the hex value for decimal 1337):
int *badaddr = (int *)1337;
*badaddr = 42;
Looking back at the FDC header, the comment not only identifies the fault as a SIGSEGV, but it also gives the address that caused the failure: "SIGSEGV: address not mapped(539)". If the exit had instead dereferenced a NULL pointer, the comment would have read "SIGSEGV: address not mapped(0)", at least on AIX. Due to differences between operating systems the comment can vary a bit.
In order to clean up the problem it is first necessary to stop the queue manager. Because this amqrmppa process is in a perpetual hang, endmqm won't complete successfully until it is killed, for example by running 'kill -9 311466'. Once the queue manager and its listener have stopped, the rebuilt exit program should be installed. On AIX root must run the command 'slibclean' to flush the old exit code out of the kernel library cache. Finally, after restarting the queue manager, the new exit is able to handle incoming client connections without any problem:
aemaix4> amqsputc TEST JTF
Sample AMQSPUT0 start
target queue is TEST
Hello, world.
While this sample client application is active, runmqsc shows the SVRCONN channel is running:
display chstatus(IPVAUTH.SVRCONN)
1 : dis chs(IPVAUTH.SVRCONN)
AMQ8417: Display Channel Status details.
CHANNEL(IPVAUTH.SVRCONN) CHLTYPE(SVRCONN)
CONNAME(127.0.0.1) CURRENT
RQMNAME( ) STATUS(RUNNING)
SUBSTATE(RECEIVE) XMITQ( )
These tests and the absence of any new FDC files confirms that the correction to the channel exit solved the SIGSEGV exception in the WebSphere MQ channel process.
Comments
Post a Comment