In parallel to UAT, we planned to run a load test during the off hours. The sample 5-user test with one iteration ran fine and with high hopes started 20 users with continous iterations for 30 minutes. The server crashed the first time in 20 minutes, then again in 15 minutes, a total of 5 crashes. Since it was towards the end of the lifecyle, we almost peed in our pants.
The core dump gave the following details:
Could not create the Java virtual machine.
Unhandled exception
Type=Segmentation error vmState=0x00000000
J9Generic_Signal_Number=00000004 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000033
Handler1=F113E110 Handler2=F113814C
R0=00000188 R1=3DCE23E8 R2=F117A7C0 R3=35AB1558
R4=00002AA8 R5=35AB4000 R6=000000B0 R7=1015B07D
R8=0015B07D R9=00000000 R10=00000000 R11=00000000
R12=59005335 R13=359D6800 R14=36658ACC R15=59FEBDF0
......................
......................
FPR29 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR30 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR31 0000000000000000 (f: 0.000000, d: 0.000000e+00)
Target=2_30_20071004_14218_bHdSMR (AIX 5.3)
CPU=ppc (4 logical CPUs) (0x200000000 RAM)
JVMDUMP006I Processing Dump Event "gpf", detail "" - Please Wait.
Since nothing was very clear in the dump file, we started investigating all possible avenues. One pattern that we studied during the first two crashes was the log file rotation, but that turned out to be a false alarm instead as we tried simulating it.We tried monitoring the memory but didn't find anything suspicious either.
Then we took the native_stderr.log that was generating and also verbosed the gc collection to check if there's any abnormal GC activity, but all that was shown was within control. Before each of the crash, the server was taking a dump and the reason cited for the behavior like "Type=Segmentation error vmState=0x00000000". For only one crash, I'm seeing the "Type=Segmentation error vmState=0x00050000" issue with more information like "Method_being_compiled=javax/faces/component/html/HtmlPanelGrid.getOnkeypress()Ljava/lang/String;" This made us to believe that it was a problem in the Just-In-Time (JIT) compiler; a work around for this was to disable JIT compilation at runtime. However, IBM docs warn us of performance degradation if that's done. This made us to look into a little more detail and was able to lay my hands on this: http://www-01.ibm.com/support/docview.wss?rs=0&uid=swg24019205
A gist of the same, there were some changes that were introduced in Websphere 6.1.0.15 which made WAS to be vulnerable to crash easily identifiable during any stress/load testing. PK64529 resolves this issue and it is available in the 2nd link provided above. The fix is available with Fix Pack 6.1.0.17 released on 3-Jun-2008.
We didn't upgrade to build 17, but installed the patch and the load testing continued with no issues. In fact, no failures, no errors, JVM heap usage hovering around 250M min heap was set at 512M, Process CPU usage about 6%, the response time was also satisfactory.
So, if you've installed WAS 6.1.0.15 on AIX 5.3, please make sure you have the patch installed as well.
No comments:
Post a Comment