Windows server 2003
CF10 enterprise - update 11 installed
Apache 2.2 with mod_ssl
After about 8 hours of use eventually the logs will start showing this message and no further requests to coldfusion are possible. Static assets are still served by apache however.
[Sun Jul 21 01:00:11 2013] [1796:1092] [error] ajp_service::jk_ajp_common.c (2470): Failed allocating AJP message buffer
[Sun Jul 21 01:00:11 2013] [1796:1092] [info] jk_handler::mod_jk.c (2748): Service error=-5 for worker=cfusion
[Sun Jul 21 01:01:36 2013] [1796:3492] [error] ajp_service::jk_ajp_common.c (2455): Failed allocating AJP message buffer
[Sun Jul 21 01:01:36 2013] [1796:3492] [info] jk_handler::mod_jk.c (2748): Service error=-5 for worker=cfusion
[Sun Jul 21 01:01:37 2013] [1796:2488] [error] ajp_service::jk_ajp_common.c (2455): Failed allocating AJP message buffer
[Sun Jul 21 01:01:37 2013] [1796:2488] [info] jk_handler::mod_jk.c (2748): Service error=-5 for worker=cfusion
Steps taken to try and address
I think I am out of ideas short of swtiching to IIS. But have reservations based on the fact CF10 had workers problems with IIS as well and this is win2k3 meaning the horrible iis6 and no native rewrite rules.
I don't know what is going wrong for you there and CF with apache is not my environment however some things come to mind you can try.
Are those errors reported in "mod_jk" log (cf10\config\wsconfig\N\) file? Any java errors in "cf-out" or "cf-error" logs (cf10\cfusion\logs)? Reason I ask is tomcat connector is reporting a buffer issue, tomcat uses java and sometimes one of the java memory buffers (heap new perm) could be reaching a limit.
Something else that might be worth trying is add tomcat native library to use tomcat AJP apr rather than AJP bio. Weekend for me so if you want some more details on that reply and I will respond when I am at work.
I favour applying similar values as those listed in the link you mention. One point of contention with those values is the worker.cfusion.connection_pool_timeout=60 in mod_jk properties and connectionTimeout= "60000" in server.xml. Some references suggest a value of 10 minutes not 1 minute so that would be pool_timeout=600 and connectionTimeout= "600000". Note you need to restart CF and apache to apply those tomcat changes.
The errors are in mod_jk.log and there are no errors in out and error logs on any of the instances. The cf logs just stop at the time of the outage in cf (meaning no more requests are making it to server).
Memory seems fine in fusion reactor. But your mention of it being a memory issue aligns with what I have seen when googling the error. I could try increasing permgen to see if that helps, the main heap seems ok though.
Will try the 10 minute timeout.
And yeah would be interested in the native tomcat library option. I have a mixture of railo and acf clients and have no issues with railo clients on mod_jk/w2k3/apache 2.2 (though the publically released mod_jk, not the custom adobe one). So I find this very odd.
With Permanent Generation it can be a good idea to set an initial setting. Windows 2003 32 bit I guess? Suggest use -XX:PermSize=192m -XX:MaxPermSize=256
Something else I neglected to mention about tomcat settings. Like java sometimes the initial or minimum settings are more important than maximum. To that end in mod_jk properties and server.xml and thus CF stability and performance can benifit from:
worker.cfusion.connection_pool_minsize="max"/2 (half maximum setting)
minSpareThreads=""max"/2" (ditto half maximum)
Yeah I always set a min permsize....many years experience tuning and troubleshooting jvm (i also manually assign the size of newgen etc) but this is the first time I've had an issue with the connector. To be honest the heap has been fine, and I would have expected if it was a permgen issue I would have seen the usual signs in the logs. I increased more because I didn't have any better options so willing to try anything
minSpareThreads do you mean in the executor? It's not listed in the documentation as an attribute for the connector.
On that note the docs seem to indicate the maxThreads defined by the executor will override any setting in the connector. So maybe by updating the connector but not the executor based on that blog post, it's actually not doing anything at all?
The maximum number of request processing threads to be created by this Connector, which therefore determines the maximum number of simultaneous requests that can be handled. If not specified, this attribute is set to 200. If an executor is associated with this connector, this attribute is ignored as the connector will execute tasks using the executor rather than an internal thread pool.
First beg my pardon I notice a typo in earlier post, should read:
Suggest use -XX:PermSize=192m -XX:MaxPermSize=256m
Regarding - Unless specified maxthreads=200. What I have suggested is to increase minsparethreads since you can realise a default setting of 10 can be not very many threads waiting in pool and forces tomcat to not reach it's potential always removing threads from memory to reduce it's foot print.
Here is the references for tomcat 7, CF10 using a customised 7 not 6:
Keep in mind while I have many CF10 production servers they are all IIS and my CF10 apache knowledge is little more that setup configure just to see what it looks like.
To apply a minimum thread setting on IIS tomcat properties file could look similar to EG:
worker.cfusion.connection_pool_size = 400
worker.cfusion.connection_pool_timeout = 600
Correspondingly server.xml AJP portion would have this syntax EG:
<Connector port="8012" protocol="AJP/1.3"
Something important to mention because I lack enough apache. The tomcat connector and configuration references have some red apache caveats. Applying thread changes that have worked well in an IIS may not suit apache.
You mention your familiar with JVM tuning monitoring. If you are savvy with using JVM tool jconsole (and lesser so jvisualvm) you could use that to monitor graphically what is happening with the tomcat threads. A bit hard to quickly write that up tho if you are interested I did a talk and demo on that and can find the link to recording.
I will post some tomcat native details soon tho I discuss that as well in above mentioned talk.
Re promised tomcat native library details.
What you could try here is to see if APR relieves the problem. Now don't get me wrong I am just throwing idea's out there for you to try and could be going in the wrong direction. The way I see it is there are not many other log details to go on and you are tackling the issue via1)tomcat adjustments 2)java parameters and I have also suggested 3)monitoring. APR is a tomcat alteration.
When CF starts CF10\cfusion\logs\coldfusion-error.log reports this output EG:
INFO: The APR based Apache Tomcat Native library which allows optimal
performance in production environments was not found on the java.library.path:
INFO: Initializing ProtocolHandler ["ajp-bio-8012"]
INFO: Starting service Catalina
INFO: Starting Servlet Engine: Apache Tomcat/7.0.23
Here is the APR documentation
Download the Windows binary distribution here (or other mirror)
The downloaded ZIP contains two tcnative-1.dll files for 32 and 64 bit. Depending on your Windows 2003 bit copy appropriate tcnative-1.dll to CF10\cfusion\lib . Now when you restart CF the coldfusion-error.log reports:
INFO: Loaded APR based Apache Tomcat Native library version.
INFO: APR capabilities: IPv6 [true], sendfile [true], accept filters [false],
INFO: Initializing ProtocolHandler ["ajp-apr-8012"]
INFO: Starting service Catalina
INFO: Starting Servlet Engine: Apache Tomcat/7.0.23
Perhaps with tomcat AJP switched to use APR you might find successful outcome or even different log details which in turn might lead to a solution.
Thanks Carl, I'll give the native tomcat libraries a shot.
I'll also try the minsparethreads, thanks for pointing me in right direction regarding tomcat version cf ships with. That was a bit of google fail on my part when looking up the doco.
I've used visualvm for troubleshooting in the past, but will give jconsole a try.
As you know from jvisualvm once CF java has JMX parameters applied and jconsole connects, select MBeans (managed beans) tab open Catalina (aka tomcat) - threadpool - ajp-bio-port (or ajp-apr-port depending if native library installed) – attributes then you can open thread count busy timeout to see what is happening with tomcat connector at the CF end.
So still running the cf shipped tomcat library, however with jconsole hooked up.
Crashed with the following:
Permgen is using 25%
Heap is using 25%
currentThreadBusy is 0
currentThreadCount is 10
Clearly not a resource starvation issue.
I guess next step is to try the native tomcat library.
Thanks for your observations. Well indeed that would appear not overly stressed at all.
Using RDP or console Windows TASKMANGER | Processes tab, how much memory Commit Size (you might have to add that column and tick show process for all users) does the coldfusion.exe grow to?
Also might be interesting to know is CF10 using java 6 or 7 (CFadmin > System Information | Java Version and Java VM Name)?
I hope the native tomcat library (APR) helps. Carl.
Last successful request 18:03:11.40 Memory 34%
Server uptime was approx 6 hours 30 mins at crash time
Java Version 1.7.0_15
Coldfusion out log stops at 16:25 (approx 96 mins before requests cease) a rotate occurred but new file is 0 byte.
Painful I agree. At least the error details are consistent tho do not know yet what the errors mean.
Heap is larger than 1.4Gb so I guess Win 03 and CF10 64 bit.
With monitoring the AJP threads double click the currentthreadcount currentthreadbusy and keepalivetimeout numbers (10, 0 & 0) to get a graph. There could still be something going on there unnoticed since those details would not be updating unless you were to keep pressing refresh button.
Further CPU usage drops off at 6pm when crash occurred yet jconsole has not lost connection to CF so java would still seem to be responding something connector or tomcat wise has terminated.
The way I read the error is the same as before trying tomcat native library?
Interesting cf-error log commences having issues 20 minutes before crash. Are those errors new details since a change was applied as I read before no errors were occurring except what noticed in mod_jk log?
This detail (-XX:PermSize=192m -XX:MaxPermSize=256m) mentioned earlier applies to 32 bit it would perhaps not suit 64 bit tho I don't think dealing with a java perm problem here. Just thought I should say for interested readers.
The bufferoverflow is only new error since switching to native (looks like a tomcat bug fixed in latest versions).
Scheduled tasks continue to run
I can still query with jconsole etc.
Fusion reactor is still connected, I can still get metrics via it.
permgen is currently at 512M which is complete overkill but plenty of memory on box and I was merely experimenting to no effect.
cpu drops off because traffic has switched to failover server of the cluster (load balanced by a dedicated hardware load balancer).
I was refreshing the ajp threads in console but i never cause the crash at that specific time. Due to the length of time it takes to crash and the fact it happens at inconvenient times I haven't been able to catch it at that specific time. But I don't really think it would help. If it was a thread overload I can understand, but surely once the threads are free it would be able to resume. But it's a hard unrecoverable crash.
To be honest I really wish adobe didn't cludge the tomcat and give us a modified version. It really obfuscates potential root causes and even limits my ability to work around the problem (e.g mod_proxy).
So altering to AJP to APR has not stopped crash but has given some extra log details. While that does not solve immediate problem it does provide some more leads to follow.
Threads monitoring – that is where turning that in to a graph can help because you can see history if you miss pressing refresh.
Nothing unusual, no pattern.
I've turned on historical graph but don't hold out much hope of it helping.
i think I am left with 2 choices now, convert to railo or upgrade the box to win2k12 and switch to iis.
Railo, not because I am taking my bat and ball and going home but simply because I have I don't exprience this problem working with the standard tomcat/mod_jk install on other clients. And additionally I have more control over the tomcat which gives me more options to solve this problem (e.g changing to mod_proxy_ajp)
IIS/W2k12 This is the other option.
I am not really happy with having to use either option and now I need to weigh up which will give me stability faster.
The AJP graph will be interesting from a technical view point tho it will not stop the outage.
Railo hard for me to say. I know IIS8/W2k12/CF10 works very well for me with some connector modifications applied.