Unresponsive server over time

Report · Jul 03, 2010

I'm trying to track down why our servers become unresponsive over time. After about 3-5 hours CF will start to queue requests and the only thing I can do is restart the services. I'm not seeing any hints in all of our hardware monitoring that it is a hardware related issue. Our database servers are fine, the load on the systems are fine but after about 3-5 hours, they'll just start queuing and are unable to recover on their own without a restart. The only hint I've been able to gather is there will be a large number of locked threads.

Report · Jul 05, 2010

Dustin, you simply need to understand what those "locked threads" are doing. To do that, you need a "stack trace" of a given running request, which reports exactly what (java method) the request is doing at that moment. Actually, you want to take a couple, some time apart (seconds, minutes), to see if it shows being hung on the same java method (which will generally underlie a given CFML tag). The stack trace will often report the line number of a specific line of CFML code, and if that's the same for a given request over multiple stack traces, then that's your smoking gun. It could be any of several things, and it could be in one or several CF templates that has the root cause problem.

Indeed, CF may be only a victim, not the cause, if it's waiting for a reply from something outside of the JVM (which underlies CF), such as if it's doing a CFHTTP, a CFQUERY, a file i/o, or any of a few other rather common tags where, if they don't get a response, the thread will be left in a state where the JVM (and CF) cannot interrupt it, and as you note, the thread is then "hung", which may be for a while or indefinitely.

So how do you get a stack trace? Well, the easiest way is if you have one of the monitoring tools, like the CF Server Monitor in CF 8/9 Enterprise/Developer editions. If you're on an earlier release (or the Standard edition, which has no such monitor), then the next easiest way is using either of the two commercial CF monitoring tools: FusionReactor (fusion-reactor.com) or SeeFusion (seefusion.com). All 3 of these make getting a stack trace just a click of a button.

I offer more details on obtaining (and understanding) stack traces in either of these 3 tools in a presentation I recorded for the CFMeetup recently, available at http://www.carehart.org/presentations/.

There are still other ways to get stack traces for free, at least if you start CF from the command line, or using Java tools (if you install a JDK, though again it may require that you start CF a certain way even to use such tools). The tools above are just a lot easier (and both the commercial ones have free trials).

They also all have an option to allow you to obtain a stack trace automatically (when a certain condition arises that would warrant it) and will email a thread dump (stack traces of all running threads), so that you need not be online at the time of the problem to see the stack traces. They're a little harder to work with than the features above, which look at just one request, because you'd need to wade through the thread dump of all threads to find just those you're interested in (typically threads whose id's are prefixed with jrpp, which refer to CF requests that come in through an external web server). Again, you'd want to compare two such stack traces for a given req2uest over a brief period of time.

Bottom line: there's no real mystery to why a CF server hangs, if you know where to look and how to interpret the info. Besides the resources offered above, I will add as well that some of us do such CF server troubleshooting for a living. Because we do it often, we can often help solve such problems quite quickly. I list several such companies or consultants, including myself, at http://www.cf411.com/#cfassist. Hope that helps.

/charlie

/Charlie (troubleshooter, carehart.org)

Report · Jul 05, 2010

Thanks for the input. Just so happens I've pulled some snapshots while one of the servers was having issues. I see a lot of threads blocked with this at the top:

"jrpp-201" prio=5 tid=357968 BLOCKED
     at coldfusion.server.j2ee.sql.pool.JDBCPool.checkOut(JDBCPool.java:346)
     at coldfusion.server.j2ee.sql.pool.JDBCPool.requestConnection(JDBCPool.java:791)
     at coldfusion.server.j2ee.sql.pool.JDBCManager.requestConnection(JDBCManager.java:125)
     at coldfusion.server.j2ee.sql.JRunDataSource.getConnection(JRunDataSource.java:138)

I'll also see some like this:

"jrpp-327" prio=5 tid=420399 BLOCKED
     at coldfusion.server.j2ee.pool.ObjectPool.checkIn(ObjectPool.java:263)
     at coldfusion.server.j2ee.sql.pool.JDBCPool.returnConnection(JDBCPool.java:817)
     at coldfusion.server.j2ee.sql.pool.JDBCPool.connectionClosed(JDBCPool.java:299)
     at coldfusion.server.j2ee.sql.JRunConnection.sendCloseEvent(JRunConnection.java:307)
     at coldfusion.server.j2ee.sql.JRunConnectionHandle.close(JRunConnectionHandle.java:56)

and a few of these
"jndi-4" prio=5 tid=25 BLOCKED
     at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:382)
     at java.net.ServerSocket.implAccept(ServerSocket.java:453)
     at java.net.ServerSocket.accept(ServerSocket.java:421)
     at jrun.servlet.network.NetworkService.accept(NetworkService.java:356)

The first two make me believe it is a database connection issue, yet even when the server is having issues I can connect to our database servers without issue and see there are few if any running queries.

If it is a database issue I'm not sure how I can go resolving the issue. We have an extensive amount of monitoring in place and nothing shows anything out of the ordinary.

Report · Jul 06, 2010

Right, these suggest it may not be an issue with the DB itself but rather with your connection pool (from CF to the DB), which is why you'd have no problem connecting to the DB on your own. Indeed, it may be that you could connect to the DB from CF using a different datasource pointing at the same DB (may be worth a test, next time it happens). You could also turn off the "maintain connections" option in the CF DSN config, which stops it using connection pooling. I'm pretty sure that takes effect instantly, not requiring a restart, so again worth a test.

As for resolution, that depends on a number of factors. First, as for that "maintain connections" option, there are debates about whether one should or should not use that option. There are also settings there for how many connections there should be, and how long they should last. Of course, sometimes the problem isn't so much the settings but instead some activity on your server that over-taxes the connection. It could be that your site is being hit by a wave of spider/bot requests, that could overwhelm the pool unexpectedly (and only intermittently). This is again where monitoring (and better yet, logging and analyzing the results) can be helpful.

Sticking with the "why is the connection pool overwhelmed" for a moment, you may also be suffering due to problems caused by using a database for Client Variable storage. While it's of course generally recommended, what's often not explained is the impact of something called "global variable updates", which could lead to reads and updates to that DSN on every page request. I've written and spoken about that in resources also offered on my site (the presentations and blog sections of carehart.org.)

Finally, there may also be issues related to what version of CF you're running and what hotfixes (since for various releases of CF there have been hotfixes addressing connection pooling issues). It can also matter what version of the DB drivers (if there may be more updated ones offered by Adobe that address known issues).

I'll note as well that you have shown only a couple of seeming random stack traces. These may be victims, waiting behind some other running request that is really at the root cause (usually the longest-running one). This again is where it's so valuable having a monitor to show you how long requests have been hanging, so you could stack trace the oldest. You refer to these as "snapshot", so I sense you are running the CF Server Monitor (8 or 9 Enterprise). In that snapshot, you can see what request has run the longest, and then look at the stack trace for that. Is it different in any way?

Let us know if any of this helps. (Apologies to those of the twitter generation who may feel this note is 20 times longer than you prefer. The sad truth is that problems like this can't always be answered in just a single short phrase. If it was that easy, people wouldn't struggle so to resolve such problems. There's a reason people like myself can make a living providing support for such problems. There are often several possible explanations and resolutions for any problem, and often on solving one, yet another pops up that was masked. Bottom line, though, CF Servers can indeed run well. Mine runs indefinitely, and yes on Windows, with CF only being restarted when the box needs to be, for Windows updates.)

/charlie

/Charlie (troubleshooter, carehart.org)

Report · Nov 30, 2013

Dustin, I am starting to see something similar with my CF10 setup. Did you ever solve this problem? I have been troublshooting random 503 server downtime issues for about a month yet with no luck, but now I am focusing on my system snapshots that take place whenever I get an Unresponsive Server alert. They are filling with log entries that look like this:

"ajp-bio-8012-exec-397" prio=5 tid=1289 BLOCKED

at coldfusion.server.j2ee.sql.pool.JDBCPool.checkOut(JDBCPool.java:371)

at coldfusion.server.j2ee.sql.pool.JDBCPool.requestConnection(JDBCPool.java:825)

at coldfusion.server.j2ee.sql.pool.JDBCManager.requestConnection(JDBCManager.java:125)

at coldfusion.server.j2ee.sql.JRunDataSource.getConnection(JRunDataSource.java:135)

at coldfusion.sql.CFDataSource.getConnection(CFDataSource.java:42)

at coldfusion.sql.DataSrcImpl.getCachedConnection(DataSrcImpl.java:156)

at coldfusion.sql.DataSrcImpl.getConnection(DataSrcImpl.java:110)

at coldfusion.sql.SqlImpl.execute(SqlImpl.java:363)

at coldfusion.tagext.sql.QueryTag.executeQuery(QueryTag.java:1059)

at coldfusion.tagext.sql.QueryTag.doEndTag(QueryTag.java:688)

ETC.

I can't seem to recreate, but it happens fairly reliably most evenings, and I have yet to trace it back to anything else in terms of the CF10 Tomcat connector (been there, done that!) or other issues such as JVM. So now I am focusing on the database connection and possible code issues related to CFLOCK and that sort of thing. Any ideas where I should look to debug this?? I am on CF10 (update 12) 64-bit Win2008 IIS7 clean install. Medium site.

So where or what should I be looking at? My datasource? My code? My Oracle datasource has default settings: 100 Max Pooled Statements, 20 min timeout, 7 min interval and Maintain Connections = yes.

Btw, in the coldfusion-error-1.log at the exact same time as the snapshot above was taken I see stuff like this, but I think it's just telling me my Server Monitor has killed the threads, which is exactly what I have it set to do during an Unresponsive Server alert, so not sure this is really where I should be looking:

Nov 29, 2013 9:55:35 PM org.apache.catalina.core.StandardWrapperValve invoke

SEVERE: Servlet.service() for servlet [CfmServlet] in context with path [/] threw exception

coldfusion.monitor.event.MonitoringServletFilter$StopThreadException: The request has been canceled by the administrator or by the server.

at coldfusion.monitor.event.MonitoringServletFilter.doFilter(MonitoringServletFilter.java:65)

at coldfusion.bootstrap.BootstrapFilter.doFilter(BootstrapFilter.java:46)

at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)

at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)

at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)

at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)

at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)

at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)

at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)

at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:928)

at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)

at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:414)

at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:204)

at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:539)

at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:298)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

Exception thrown by error-handling template:

coldfusion.monitor.active.ActiveRequestMonitor$RequestAbortedException: Your request was aborted by the server.

at coldfusion.monitor.event.RequestMonitorEventProcessor.onError(RequestMonitorEventProcessor.java:551)

at coldfusion.filter.ExceptionFilter.handleException(ExceptionFilter.java:179)

at coldfusion.filter.ExceptionFilter.invoke(ExceptionFilter.java:108)

at coldfusion.filter.ClientScopePersistenceFilter.invoke(ClientScopePersistenceFilter.java:28)

at coldfusion.filter.BrowserFilter.invoke(BrowserFilter.java:38)

at coldfusion.filter.NoCacheFilter.invoke(NoCacheFilter.java:58)

at coldfusion.filter.GlobalsFilter.invoke(GlobalsFilter.java:38)

at coldfusion.filter.DatasourceFilter.invoke(DatasourceFilter.java:22)

at coldfusion.filter.CachingFilter.invoke(CachingFilter.java:62)

at coldfusion.CfmServlet.service(CfmServlet.java:219)

at coldfusion.bootstrap.BootstrapServlet.service(BootstrapServlet.java:89)

at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305)

at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)

at coldfusion.monitor.event.MonitoringServletFilter.doFilter(MonitoringServletFilter.java:42)