Thursday, September 4, 2008

Websphere Application Server V6.1.0.15 on AIX Issue

We were developing a web application using Richfaces and managing the beans using Springs Variable Resolver. Richfaces was chosen because the features it provided aligned with the business requirements including the file upload feature. The development was done in Tomcat 6.0 and since we had to use Richfaces 3.2.1 GA, we had to run against JSF 1.2, even had to use nightly build 10 for an issue which I'll mention in the coming days. The production and UAT environment was decided to be WAS as part of the corporate decision. However, WAS didn't release Version 7 supporting Java EE 5 and had a tough time by loading the parent classes at last in version 6.1.0.11. Finally, we decided to bring one version down and so was with JSF. In order to use the upload functionality, we used Tomahawk (MyFaces). After a small but tight struggle, we were able to meet the deadline. The corporate systems team also upgraded to WAS 6.1.0.15. Had this been done before, I wouldn't have had issues as a tweak would've done the trick to use JSF 1.2 and ultimately Richfaces 3.2.1.
In parallel to UAT, we planned to run a load test during the off hours. The sample 5-user test with one iteration ran fine and with high hopes started 20 users with continous iterations for 30 minutes. The server crashed the first time in 20 minutes, then again in 15 minutes, a total of 5 crashes. Since it was towards the end of the lifecyle, we almost peed in our pants.
The core dump gave the following details:
Could not create the Java virtual machine.
Unhandled exception
Type=Segmentation error vmState=0x00000000
J9Generic_Signal_Number=00000004 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000033
Handler1=F113E110 Handler2=F113814C
R0=00000188 R1=3DCE23E8 R2=F117A7C0 R3=35AB1558
R4=00002AA8 R5=35AB4000 R6=000000B0 R7=1015B07D
R8=0015B07D R9=00000000 R10=00000000 R11=00000000
R12=59005335 R13=359D6800 R14=36658ACC R15=59FEBDF0
......................
......................
FPR29 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR30 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR31 0000000000000000 (f: 0.000000, d: 0.000000e+00)
Target=2_30_20071004_14218_bHdSMR (AIX 5.3)
CPU=ppc (4 logical CPUs) (0x200000000 RAM)
JVMDUMP006I Processing Dump Event "gpf", detail "" - Please Wait.

Since nothing was very clear in the dump file, we started investigating all possible avenues. One pattern that we studied during the first two crashes was the log file rotation, but that turned out to be a false alarm instead as we tried simulating it.We tried monitoring the memory but didn't find anything suspicious either.
Then we took the native_stderr.log that was generating and also verbosed the gc collection to check if there's any abnormal GC activity, but all that was shown was within control. Before each of the crash, the server was taking a dump and the reason cited for the behavior like "Type=Segmentation error vmState=0x00000000". For only one crash, I'm seeing the "Type=Segmentation error vmState=0x00050000" issue with more information like "Method_being_compiled=javax/faces/component/html/HtmlPanelGrid.getOnkeypress()Ljava/lang/String;" This made us to believe that it was a problem in the Just-In-Time (JIT) compiler; a work around for this was to disable JIT compilation at runtime. However, IBM docs warn us of performance degradation if that's done. This made us to look into a little more detail and was able to lay my hands on this: http://www-01.ibm.com/support/docview.wss?rs=0&uid=swg24019205

A gist of the same, there were some changes that were introduced in Websphere 6.1.0.15 which made WAS to be vulnerable to crash easily identifiable during any stress/load testing. PK64529 resolves this issue and it is available in the 2nd link provided above. The fix is available with Fix Pack 6.1.0.17 released on 3-Jun-2008.


We didn't upgrade to build 17, but installed the patch and the load testing continued with no issues. In fact, no failures, no errors, JVM heap usage hovering around 250M   min heap was set at 512M, Process CPU usage about 6%, the response time was also satisfactory.


So, if you've installed WAS 6.1.0.15 on AIX 5.3, please make sure you have the patch installed as well.

No comments: