Siteminder R12 Performance with a Java Bug – Resolved6 min read
This is my story about the most complex, deep Siteminder performance issues I have dealt with in my 10 years of working with the product. I’ll detail this out and layout the solution.
We built a brand new Siteminder R12 environment on new hardware last year. 4 policy servers built, 2 separate environments, 2 servers running each environment.
- Siteminder SP3 CR 2
- Java JRE 1.6
- Windows 2008 64-bit
- Dual Hex Core processors, 32 GB RAM
Between our logs and theirs they were seeing that the delays were coming from the Policy Server trying to connect to the JVM. In the example below, 7 second delay.
So we had to do the next thing and hit Google to find the answer ourselves. Once again CA stepped up and figured out what the problem was.
In Java 1.6 there is a memory leak with Ljava.lang.ThreadGroup. From programming forums on the Internet this Java code does not exit properly and Garbage Collection never cleans it up. Over time the leak will slowly kill the processes and any program using it, i.e. Siteminder. This problem will exist in any R12 version you are running if you have the Option Pack installed and running JRE 1.6 (which is the required version anyway).
The solution on the Internet is that since Oracle says they will not fix this, you have to manually put in the Garbage Collection code yourself to clear it out if you touch this class. CA has informed us they will be adding this fix into a not-so-distant future CR. As of Feb 2012, it’s not GA yet.
BUT, you don’t have to wait for CA to release the CR that will fix this, you can fix this now in Siteminder yourself as long as you are not using the features below.
In Siteminder the two classes that are touching the leaky Java code are –
These two classes are part of the Option Pack and would only be enabled if you are using Java code for any active expressions or outside processing. If you are not doing either then you should be OK. By default the 3 classes in the properties file are enabled. Even though we are not using the Web Service or Federation variables, the class initializes the leaky Java class on startup. In our case we were only using Static User Context variables and to resolve the leak, we just had to comment out the other two classes.
As soon as we did that we noticed a huge change in the internal processing of the Policy Server through Wily. Response times got faster, traffic was more efficient, and it has been over 2 weeks and we haven’t touched the servers since.
End to end I spent 6 1/2 months tracking this problem which turned out to be a Java Memory Leak. In the end my environment is far more tuned and streamlined than any I have worked with in the past. We ripped everything from the OS to Siteminder apart a dozen or more times.
I have worked with big vendors all my career and CA once again showed to me that they are the most customer focused, solution driven vendor out there. If it wasn’t for them I would still have my daily service recycles and no clue to why my super charged systems couldn’t handle the traffic.
I can assure you my use of Java in the future will be heavily pulled back.
Hopefully this solution will help others out there that may be chasing a Siteminder performance issue and have no idea where it’s coming from.
End of Line.
UPDATE 2/23/12 –
Additional Information from the Java GC.
Just wanted to add the snippet from Garbage Collector log that shows the delays in GC that was halting all Java threads.
2012-01-29T22:05:28.419-0500: 159635.785: [Full GC [PSYoungGen: 32K->0K(58112K)] [PSOldGen: 268600K->233429K(262208K)] 268632K->233429K(320320K) [PSPermGen: 5608K->5595K(16384K)], 13.2658978 secs] [Times: user=13.03 sys=0.14, real=13.27 secs]
Total time for which application threads were stopped: 13.3080858 seconds
2012-01-29T23:34:23.029-0500: 164970.402: [GC [PSYoungGen: 57376K->96K(57472K)] 300602K->243330K(319680K), 1.2167912 secs] [Times: user=1.25 sys=0.08, real=1.22 secs]
Total time for which application threads were stopped: 1.2173181 seconds
2012-01-29T23:38:34.716-0500: 165222.090: [GC [PSYoungGen: 47056K->96K(56128K)] 290770K->243818K(318336K), 1.3679557 secs] [Times: user=1.44 sys=0.11, real=1.38 secs]
Total time for which application threads were stopped: 1.3849995 seconds
Application time: 27.9614079 seconds
2012-01-29T23:39:04.060-0500: 165251.428: [GC [PSYoungGen: 56096K->176K(56704K)] 299818K->243926K(318912K), 1.0524796 secs] [Times: user=1.00 sys=0.19, real=1.05 secs]
Total time for which application threads were stopped: 1.0549411 seconds
Application time: 22.8069970 seconds
2012-01-30T01:06:19.372-0500: 170486.746: [GC [PSYoungGen: 50224K->96K(55488K)] 303207K->253079K(317696K), 5.5103140 secs] [Times: user=3.91 sys=0.31, real=5.52 secs]
Total time for which application threads were stopped: 5.5591853 seconds
Application time: 28.1509650 seconds
2012-01-30T01:06:53.044-0500: 170520.417: [GC [PSYoungGen: 55456K->96K(55616K)] 308439K->253099K(317824K), 0.1145933 secs] [Times: user=0.08 sys=0.09, real=0.13 secs]
Total time for which application threads were stopped: 0.1189481 seconds
Application time: 24.6126854 seconds
2012-01-30T01:07:17.779-0500: 170545.144: [GC [PSYoungGen: 55456K->48K(57984K)] 308459K->253071K(320192K), 0.0745779 secs] [Times: user=0.08 sys=0.00, real=0.08 secs]
Total time for which application threads were stopped: 0.0750492 seconds
Application time: 90.4199492 seconds
2012-01-30T01:08:48.529-0500: 170635.903: [GC [PSYoungGen: 57776K->80K(57856K)] 310799K->253127K(320064K), 34.3953746 secs] [Times: user=16.61 sys=5.58, real=34.42 secs]
Total time for which application threads were stopped: 34.6831882 seconds
Application time: 84.4596699 seconds
2012-01-30T01:10:47.482-0500: 170754.841: [GC [PSYoungGen: 57808K->48K(57920K)] 310855K->253111K(320128K), 5.5792954 secs] [Times: user=2.63 sys=1.88, real=5.61 secs]
Total time for which application threads were stopped: 5.8194036 seconds
Application time: 89.5306510 seconds
2012-01-30T01:12:22.794-0500: 170850.161: [GC [PSYoungGen: 57776K->176K(57920K)] 310839K->253407K(320128K), 2.3820469 secs] [Times: user=1.97 sys=0.01, real=2.39 secs]
Total time for which application threads were stopped: 2.4228286 seconds
Application time: 22.0421249 seconds
2012-01-30T01:54:14.372-0500: 173361.732: [Full GC [PSYoungGen: 128K->0K(45760K)] [PSOldGen: 258100K->258215K(286656K)] 258228K->258215K(332416K) [PSPermGen: 5598K->5598K(16384K)], 74.6088031 secs] [Times: user=62.50 sys=1.83, real=74.61 secs]
Total time for which application threads were stopped: 74.6484953 seconds
Binary Blogger has spent 20 years in the Information Security space currently providing security solutions and evangelism to clients. From early web application programming, system administration, senior management to enterprise consulting I provide practical security analysis and solutions to help companies and individuals figure out HOW to be secure every day.
Follow Me On Twitter