Wed. Jun 3rd, 2020

Binary Blogger

Are you a 1 or a 0? News, Thoughts and Reviews

Siteminder R12 Performance with a Java Bug – Resolved

7 min read

This is my story about the most complex, deep Siteminder performance issues I have dealt with in my 10 years of working with the product. I’ll detail this out and layout the solution.

We built a brand new Siteminder R12 environment on new hardware last year. 4 policy servers built, 2 separate environments, 2 servers running each environment.

  • Siteminder SP3 CR 2
  • Java JRE 1.6
  • Windows 2008 64-bit
  • Dual Hex Core processors, 32 GB RAM
This is not an under-powered environment. We also put in CA Wily to monitor the Siteminder environment.
Our approach was a parallel setup meaning the new R12 environment would be built along side the existing R6 environment and we would move applications over a few at a time instead of a throw the switch. This way we have far more quality control in case we can into something bad, we could revert back to the other environment very easily.
The R12 environment was running for 4 months perfectly, no issues, nothing. Then we put our biggest application on the environment and they migrated 100,000 users over. When that happened about every 7 to 8 days we would have an unexpected climb in Response Times on the policy server. We go from an average of 20ms to over 500ms in a matter of hours and usually when we were not at peak traffic. Odd. We would recycle the services and everything would return to normal. 7 to 8 days later it would happen again. When the application continued to add users the rate of the issue reoccurring would get more frequent. When the application hit their 100% migration mark, 1.5 million users, we would have to recycle about every 32 hours.
Because of the days between issues it took a very, very long time to figure out the logs we needed, changes to be made, and it was mostly a waiting game between tweaks to see if things worked. Since it appeared to be related to the high traffic, we were never able to recreate it in the lower environments because we were never able to get the load high enough and we were never able to run a test long enough.
We engaged CA and they stepped up to help us out. It took a while, due to the time between issues, but eventually CA was able to rebuild our unique environment in their lab and they were able to successfully recreate the issue. It took CA over 60 hours of stress testing to trigger it, but they did.

Between our logs and theirs they were seeing that the delays were coming from the Policy Server trying to connect to the JVM. In the example below, 7 second delay.

7 second delay connecting to the JVM
[01/12/2012][13:40:51.876][12116][6816][      Look up a cached object.][SmObjCache.cpp:918][][][][CSmObjCache::Lookup]
[01/12/2012][13:40:58.850][12116][6816][      SmJVMSupport, Successfully attached JVM to thread][SmJVMSupport.cpp:112][][][][GetJVMEnv]
CA and our team worked over the next couple weeks getting more log tuning, debug traces, garbage collection details, and config changes trying to figure out why the delay was happening. We tried to open a case with Oracle, the proud new owners of Java, and that was a miserable failure. 3 power house companies with hundreds of millions invested into Oracle technologies all received the same answer when we tried to open a case for Oracle to answer a question on Java. We didn’t have a paid Java support contract and Oracle would not take the case. My company, CA and IBM all got the same response.

So we had to do the next thing and hit Google to find the answer ourselves. Once again CA stepped up and figured out what the problem was.

In Java 1.6 there is a memory leak with Ljava.lang.ThreadGroup. From programming forums on the Internet this Java code does not exit properly and Garbage Collection never cleans it up. Over time the leak will slowly kill the processes and any program using it, i.e. Siteminder. This problem will exist in any R12 version you are running if you have the Option Pack installed and running JRE 1.6 (which is the required version anyway).

The solution on the Internet is that since Oracle says they will not fix this, you have to manually put in the Garbage Collection code yourself to clear it out if you touch this class. CA has informed us they will be adding this fix into a not-so-distant future CR. As of Feb 2012, it’s not GA yet.


BUT, you don’t have to wait for CA to release the CR that will fix this, you can fix this now in Siteminder yourself as long as you are not using the features below.

In Siteminder the two classes that are touching the leaky Java code are –

class com.netegrity.policyserver.requester.
FederationAttributeResolver$FederationAttributeVariableThreadGroup          

 
class com.netegrity.scriptevaluation.scriptactiveexpression.
WebServiceVariableResolver$WebServiceVariableThreadGroup 

These two classes are part of the Option Pack and would only be enabled if you are using Java code for any active expressions or outside processing. If you are not doing either then you should be OK. By default the 3 classes in the properties file are enabled. Even though we are not using the Web Service or Federation variables, the class initializes the leaky Java class on startup. In our case we were only using Static User Context variables and to resolve the leak, we just had to comment out the other two classes.

Here the file that should be changed in order to resolve this memory leak
C:\Program Files\CA\siteminder\config\properties\scriptActiveExpConfig.properties.
Comment out 2 resolvers
         WebServiceVariableResolver
         FederationAttributeResolver

As soon as we did that we noticed a huge change in the internal processing of the Policy Server through Wily. Response times got faster, traffic was more efficient, and it has been over 2 weeks and we haven’t touched the servers since.

End to end I spent 6 1/2 months tracking this problem which turned out to be a Java Memory Leak. In the end my environment is far more tuned and streamlined than any I have worked with in the past. We ripped everything from the OS to Siteminder apart a dozen or more times.

I have worked with big vendors all my career and CA once again showed to me that they are the most customer focused, solution driven vendor out there. If it wasn’t for them I would still have my daily service recycles and no clue to why my super charged systems couldn’t handle the traffic.

I can assure you my use of Java in the future will be heavily pulled back.

Hopefully this solution will help others out there that may be chasing a Siteminder performance issue and have no idea where it’s coming from.

End of Line.

UPDATE 2/23/12 –
Additional Information from the Java GC.

Just wanted to add the snippet from Garbage Collector log that shows the delays in GC that was halting all Java threads.

2012-01-29T22:05:28.419-0500: 159635.785: [Full GC [PSYoungGen: 32K->0K(58112K)] [PSOldGen: 268600K->233429K(262208K)] 268632K->233429K(320320K) [PSPermGen: 5608K->5595K(16384K)], 13.2658978 secs] [Times: user=13.03 sys=0.14, real=13.27 secs]
Total time for which application threads were stopped: 13.3080858 seconds
2012-01-29T23:34:23.029-0500: 164970.402: [GC [PSYoungGen: 57376K->96K(57472K)] 300602K->243330K(319680K), 1.2167912 secs] [Times: user=1.25 sys=0.08, real=1.22 secs]
Total time for which application threads were stopped: 1.2173181 seconds
2012-01-29T23:38:34.716-0500: 165222.090: [GC [PSYoungGen: 47056K->96K(56128K)] 290770K->243818K(318336K), 1.3679557 secs] [Times: user=1.44 sys=0.11, real=1.38 secs]
Total time for which application threads were stopped: 1.3849995 seconds
Application time: 27.9614079 seconds
2012-01-29T23:39:04.060-0500: 165251.428: [GC [PSYoungGen: 56096K->176K(56704K)] 299818K->243926K(318912K), 1.0524796 secs] [Times: user=1.00 sys=0.19, real=1.05 secs]
Total time for which application threads were stopped: 1.0549411 seconds
Application time: 22.8069970 seconds

2012-01-30T01:06:19.372-0500: 170486.746: [GC [PSYoungGen: 50224K->96K(55488K)] 303207K->253079K(317696K), 5.5103140 secs] [Times: user=3.91 sys=0.31, real=5.52 secs]
Total time for which application threads were stopped: 5.5591853 seconds
Application time: 28.1509650 seconds
2012-01-30T01:06:53.044-0500: 170520.417: [GC [PSYoungGen: 55456K->96K(55616K)] 308439K->253099K(317824K), 0.1145933 secs] [Times: user=0.08 sys=0.09, real=0.13 secs]
Total time for which application threads were stopped: 0.1189481 seconds
Application time: 24.6126854 seconds
2012-01-30T01:07:17.779-0500: 170545.144: [GC [PSYoungGen: 55456K->48K(57984K)] 308459K->253071K(320192K), 0.0745779 secs] [Times: user=0.08 sys=0.00, real=0.08 secs]
Total time for which application threads were stopped: 0.0750492 seconds
Application time: 90.4199492 seconds
2012-01-30T01:08:48.529-0500: 170635.903: [GC [PSYoungGen: 57776K->80K(57856K)] 310799K->253127K(320064K), 34.3953746 secs] [Times: user=16.61 sys=5.58, real=34.42 secs]
Total time for which application threads were stopped: 34.6831882 seconds
Application time: 84.4596699 seconds
2012-01-30T01:10:47.482-0500: 170754.841: [GC [PSYoungGen: 57808K->48K(57920K)] 310855K->253111K(320128K), 5.5792954 secs] [Times: user=2.63 sys=1.88, real=5.61 secs]
Total time for which application threads were stopped: 5.8194036 seconds
Application time: 89.5306510 seconds
2012-01-30T01:12:22.794-0500: 170850.161: [GC [PSYoungGen: 57776K->176K(57920K)] 310839K->253407K(320128K), 2.3820469 secs] [Times: user=1.97 sys=0.01, real=2.39 secs]
Total time for which application threads were stopped: 2.4228286 seconds
Application time: 22.0421249 seconds

2012-01-30T01:54:14.372-0500: 173361.732: [Full GC [PSYoungGen: 128K->0K(45760K)] [PSOldGen: 258100K->258215K(286656K)] 258228K->258215K(332416K) [PSPermGen: 5598K->5598K(16384K)], 74.6088031 secs] [Times: user=62.50 sys=1.83, real=74.61 secs]
Total time for which application threads were stopped: 74.6484953 seconds

Copyright © All rights reserved. | Newsphere by AF themes.