Copy link to clipboard
Copied
Recently installed CF11 on a Windows 2012R2 server running IIS 8.5. There are currently 9 sites configured in IIS, all under separate application pools.
Several times a day, for reasons I have not yet determined, requests for CF pages under any one of the sites are returned very slowly (>10 seconds). Typically, they are returned in <500 ms. Recycling the application pool (or pools) appears to address this issue, which made me think the issue was with IIS. However, requests for static resources from the same IIS sites load fine, so I suspect it is not strictly an IIS issue.
I applied the changes to worker.properties and server.xml recommended by Anit Kumar Panda in ColdFusion 11 IIS Connector Tuning — Adobe ColdFusion Blog:
I'm still seeing slow responsiveness several times a day. I'll continue troubleshooting but thought I'd reach out to the community for ideas/troubleshooting tips.
After additional analysis of the times the slowdowns occurred, it became apparent that the majority of occurrences were between 7 P.M. and midnight, and that response times increased immediately at 7 P.M. A backup process running on another machine that started at 7 P.M. daily was copying data from the CIFS share containing the document roots of each site. The network interface was the bottleneck: due to the high volume of data transferred the server was slower to respond to requests. One of my
...Copy link to clipboard
Copied
Hi,
The CF tomcat components run on Java. You many need to tune some of the Java heap or non heap values. Could be a full garbage collection taking 10 or so seconds.
Java reminders. When a full garbage collection occurs that is a Java stop all process. Everything running on Java ie CF tomcat will appear paused or not responding for the GC duration.
HTH, Carl.
Copy link to clipboard
Copied
Thanks. I am currently using the defaults for JVM heap size:
I can definitely try increasing those values.
I added -Xloggc:{application.home}/logs/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps to the JVM arguments, I'll see how long the GC is taking and if those times correspond to the slowdowns.
Thanks for the suggestions! I'll update this post when I have more info.
Copy link to clipboard
Copied
Be mindful of the GC types.
Full GC (Metadata GC Threshold) - because non heap needs collecting
Full GC (Ergonomics) - because heap needs collecting
Full GC (Last ditch collection) - not good. Could be it takes half second but if 20 are occurring one after the other that's 10 seconds of Full GC
Regards, Carl.
Copy link to clipboard
Copied
We've had a couple additional instances of IIS/CF11 being slow or failing to respond. Checking the GC logs, there are no Full GCs corresponding to the slow/no response from the server. I do see both Full GC (Metadata GC Threshold) and Full GC (Ergonomics) events but nowhere near the times of the problems. The GC (Allocation Failure) events are taking 0.1 seconds at most.
In the most recent case, only one application pool failed to respond. All the others were fine. Restarting that one application pool fixed the problem.
Copy link to clipboard
Copied
Hi,
So does not appear like a Full GC is causing slow down. If nothing else it is good to discount something and move on.
Interesting recycling webserver application pool relieved problem. Perhaps you could do well to enable JMX (Java management extensions) then use JDK tools like JMC (Java mission control) or jconsole to graphically see what is happening with the tomcat connector AJP.
Here is an example of JMC visualising AJP:
Here is Jconsole monitoring same:
Note tomcat “native” installed so AJP-APR rather than AJP-BIO which you will likely see. Those examples are not particularly busy right now, nor do they show tomcat having an issue. Having said that I think you will notice the difference upon tomcat from when the system is operating ok versa going slow. Given a difference can be detected then one can go about figuring out why that is happening or make an adjustment to Java and tomcat to cope.
HTH again, Carl.
Copy link to clipboard
Copied
Thanks for the advice, I'm running JConsole now. Mine does look a little different than yours, presumably because I'm using AJP-BIO. For reference, here are some of my baseline numbers under the ajp-bio-8014 attributes:
I'll post updated values once we experience the slowness/unresponsiveness again. Any other values you'd recommend looking at?
Copy link to clipboard
Copied
Hi,
Will need to wait your feedback on values however I feel the displayed value currentThreadCount 10 is not enough. This is the default tomcat value which corresponds to:
workers.properties worker.cfusion.connection_pool_minsize=10
server.xml AJP section minSpareThreads="10"
When pool minsize / thread minspare are not specified then 10 is configured. Sure the system can spin up more threads into memory to reach your maximum defined value but there will be a cost to performance to make the threads available and subsequently release them to go back to the desired minsize minspare setting.
One other value to know. What is being shown for keepAliveCount?
Regards, Carl.
Copy link to clipboard
Copied
Good to know re: connection_pool_minsize and minSpareThreads. I think with 9 sites, that number should definitely be higher! I'll try 100 and see if that helps.
I didn't find a keepAliveCount attribute under ajp-bio-8014. There was a keepAliveTimeout attribute of 60,000.
I saw on a different forum post you had suggested the AJP-APR connector as a replacement for the AJP-BIO. Any particular reason?
Copy link to clipboard
Copied
>I didn't find a keepAliveCount attribute under ajp-bio-8014
Interesting. I have not monitored CF tomcat with BIO for a long time, preferring APR. Perhaps with BIO jconsole does not see keepAliveCount. Unfortunately I don't have time to alter a system to BIO and check.
Copy link to clipboard
Copied
On July 16th I made a couple of changed recommended by a ColdFusion consultant:
wsconfig had not been run after applying the CF updates. The consultant mentioned that the update instructions include running wsconfig, but I must have overlooked it. I was also unaware that wsconfig made changes other than to the settings I had been adjusting manually. The JVM arguments have changed from Java 7 to Java 8, hence the JVM argument change.
Since that time we have had one similar incident/alert of slow response from the server, but in that case it recovered the following minute without intervention. I will continue to monitor the server responsiveness and update again next week.
Copy link to clipboard
Copied
Interesting, it just happened again (I haven't restarted CF yet, so still at 10 minSpareThreads). The ajp-bio-8014 attribute values were normal when I looked at them (although maybe it had spontaneously recovered by that point), but the threads graph shows a spike:
The peak is 153 threads (live threads is 108), which still seems well within reasonable limits, but that there is a peak at all corresponding with the behavior is very interesting.
Copy link to clipboard
Copied
Some short AJP APR reading:
https://tomcat.apache.org/tomcat-7.0-doc/apr.html
Partly says:
Tomcat can use the Apache Portable Runtime to provide superior scalability, performance
Read into that what you like. On a forum it can be hard to say definitely do this or that. I would prefer to offer APR as a suggestion in replace of BIO but as always folks should do their own dev testing. Having said that you are actively monitoring production so that can be helpful to know if APR assists or not.
Agree with your assessment to change pool minsize / thread minspare to 100. That is two changes, properties which requires IIS restart and xml which requires CF restart – to apply.
That is helpful to notice a difference in monitored values when system went slow. At least you know about something that is measurable.
Those 108 to 153 threads are not necessarily all tomcat AJP.
When that went slow was "Old" part of heap near maximum?
Regards, Carl.
Copy link to clipboard
Copied
Before things get too complicated. Which update to CF 11 are you running? They are currently on HF5 and a number of fixes have been made to previous hotfixes that resolve issues similar to those you experience.
Copy link to clipboard
Copied
Thanks, I should have mentioned that: ColdFusion 11 Update 5.
It's also using the JVM from jdk1.8.0_45, rather than the JVM that is packaged with CF.
Copy link to clipboard
Copied
After additional analysis of the times the slowdowns occurred, it became apparent that the majority of occurrences were between 7 P.M. and midnight, and that response times increased immediately at 7 P.M. A backup process running on another machine that started at 7 P.M. daily was copying data from the CIFS share containing the document roots of each site. The network interface was the bottleneck: due to the high volume of data transferred the server was slower to respond to requests. One of my colleagues added an additional network interface to the file server and the backup process uses the new network interface exclusively. We have since noticed no increases in response times at 7 P.M.
I think that the spontaneous recovery, and recovery due to IIS app pool recycling, was likely due to regression towards the mean--I received Nagios alerts only when response times exceeded 10 seconds, after which response times likely dropped below 10 seconds. One thing that led me to believe this was a ColdFusion issue is that IIS was able to deliver static assets (e.g. image files) during the same slowdowns without any noticeable delay. I am not sure why that would be the case--perhaps it was actually slower, as I did not benchmark response times for static assets (i.e. maybe static assets loaded in 300 ms during slowdowns rather than 100 ms during normal operation).
I can't say that it had nothing to do with IIS 8.5 and CF 11, as the adjustments made based on carl type3's suggestions certainly seemed to improve performance and reduce the number of warnings and alerts! Thanks for all your help. The additional insight into the JVM will definitely be useful.