Unstable session replication in a HA cluster (CF10)

Report · Apr 11, 2014

Hi,

We have tried to create a HA cluster with requests being distributed round robin to N instances of coldfusion, we are NOT using sticky sessions as we are replication session state to all cf instances. What we are seing is that all is fine with low to moderate load, however under heavy load and at random times the replication fails and leads to things in session scope not working. This manifests in users not being able to login to our application (we store a token in session scope to store logged in status).

Again key point, under low to moderate load it all works fine, users are directed to random nodes in the cluster and their session is picked up fine as the session is distributed to all nodes,so pretty confident config is right.

Linux servers using CF10 with update 12 applied. Also running is fusion reactor 5.04 on all instances. Each instance has a 64GB heap, Java 7.0.15 (latest certified).

Firstly apache setup.

workers.properties

worker.list=balancer, jkstatus

worker.jkstatus.type=status

worker.balancer.type=lb

worker.balancer.balance_workers=cfusion_master,cfusion_slave2,cfusion_slave1

worker.balancer.method=R

worker.balancer.sticky_session=False

worker.balancer.ping_mode=A

worker.cfusion_master.type=ajp13

worker.cfusion_master.host=localhost

worker.cfusion_master.port=8012

worker.cfusion_master.max_reuse_connections=250

worker.cfusion_master.lbfactor=100

worker.cfusion_slave2.reference=worker.cfusion_master

worker.cfusion_slave2.port=8014

worker.cfusion_slave1.reference=worker.cfusion_master

worker.cfusion_slave1.port=8013

Now the server.xml from 2 nodes (as an example if I run a 2 node cluster)

One of the configs from a server in the cluster

</Listener>

</Listener>

</Listener>

</Listener>

</Resource>

</GlobalNamingResources>

</Executor>

</Connector>

</Realm>

</Valve>

</Host>

</Manager>

</Membership>

</Receiver>

</Transport>

</Sender>

</Interceptor>

</Interceptor>

</Channel>

</Valve>

</Valve>

</ClusterListener>

</ClusterListener>

</Cluster>

</Engine>

</Connector>

</Service>

</Server>

Config from one of the other nodes

</Listener>

</Listener>

</Listener>

</Listener>

</Resource>

</GlobalNamingResources>

</Executor>

</Connector>

</Realm>

</Valve>

</Host>

</Manager>

</Membership>

</Receiver>

</Transport>

</Sender>

</Interceptor>

</Interceptor>

</Channel>

</Valve>

</Valve>

</ClusterListener>

</ClusterListener>

</Cluster>

</Engine>

</Connector>

</Service>

</Server>

So what do i see in the logs?. Well sometimes I see exceptions like this

Mar 05, 2014 9:55:19 PM org.apache.catalina.ha.session.DeltaManager messageReceived

SEVERE: Manager [localhost#/]: Unable to receive message through TCP channel

java.lang.IllegalStateException: removeAttribute: Session already invalidated

at org.apache.catalina.ha.session.DeltaSession.removeAttribute(DeltaSession.java:617)

at org.apache.catalina.ha.session.DeltaRequest.execute(DeltaRequest.java:171)

at org.apache.catalina.ha.session.DeltaManager.handleSESSION_DELTA(DeltaManager.java:1347)

at org.apache.catalina.ha.session.DeltaManager.messageReceived(DeltaManager.java:1293)

at org.apache.catalina.ha.session.DeltaManager.messageDataReceived(DeltaManager.java:1014)

at org.apache.catalina.ha.session.ClusterSessionListener.messageReceived(ClusterSessionListener.java:92)

at org.apache.catalina.ha.tcp.SimpleTcpCluster.messageReceived(SimpleTcpCluster.java:897)

at org.apache.catalina.ha.tcp.SimpleTcpCluster.messageReceived(SimpleTcpCluster.java:878)

at org.apache.catalina.tribes.group.GroupChannel.messageReceived(GroupChannel.java:278)

at org.apache.catalina.tribes.group.ChannelInterceptorBase.messageReceived(ChannelInterceptorBase.java:84)

at org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.messageReceived(TcpFailureDetector.java:113)

at org.apache.catalina.tribes.group.ChannelInterceptorBase.messageReceived(ChannelInterceptorBase.java:84)

at org.apache.catalina.tribes.group.ChannelCoordinator.messageReceived(ChannelCoordinator.java:253)

at org.apache.catalina.tribes.transport.ReceiverBase.messageDataReceived(ReceiverBase.java:287)

at org.apache.catalina.tribes.transport.nio.NioReplicationTask.drainChannel(NioReplicationTask.java:212)

at org.apache.catalina.tribes.transport.nio.NioReplicationTask.run(NioReplicationTask.java:101)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:722)

I'm unsure why this happens as tribes uses certified mesaging so it should have resent right?, in any case I believe I can change it so messages are not sent asynchronously, should sort this out.

I see (good) messages like this

Mar 05, 2014 9:42:19 PM org.apache.catalina.ha.session.DeltaManager startInternal

INFO: Register manager localhost#/ to cluster element Engine with name Catalina

Mar 05, 2014 9:42:19 PM org.apache.catalina.ha.session.DeltaManager startInternal

INFO: Starting clustering manager at localhost#/

Mar 05, 2014 9:42:19 PM org.apache.catalina.ha.session.DeltaManager getAllClusterSessions

INFO: Manager [localhost#/], requesting session state from org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=68824148, securePort=-1, UDP Port=-1, id={123 126 89 39 96 -59 69 8 -113 79 51 122 25 108 -11 -110 }, payload={}, command={}, domain={}, ]. This operation will timeout if no session state has been received within 60 seconds.

Mar 05, 2014 9:42:20 PM org.apache.catalina.ha.session.DeltaManager waitForSendAllSessions

INFO: Manager [localhost#/]; session state send at 3/5/14 9:42 PM received in 929 ms.

Mar 05, 2014 9:42:20 PM org.apache.catalina.ha.session.JvmRouteBinderValve startInternal

INFO: JvmRouteBinderValve started

So session state dies appear to be flying around the cluster, I do nightly restarts of some of the nodes due to another issue I have with an ever growing heap (separate issue), interestingly I also see nodes leave and join the cluster, again this is good (shows the multicast is working, and also that replication should be working).

Mar 05, 2014 2:30:16 AM org.apache.catalina.tribes.group.interceptors.TcpFailureDetector memberDisappeared

INFO: Verification complete. Member disappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=18629101, securePort=-1, UDP Port=-1, id={-2 65 10 -79 53 -75 76 52 -99 63 -90 -120 34 -89 -14 100 }, payload={}, command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={}, ]]

Mar 05, 2014 2:30:16 AM org.apache.catalina.ha.tcp.SimpleTcpCluster memberDisappeared

INFO: Received member disappeared:org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=18629101, securePort=-1, UDP Port=-1, id={-2 65 10 -79 53 -75 76 52 -99 63 -90 -120 34 -89 -14 100 }, payload={}, command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={}, ]

Mar 05, 2014 2:35:16 AM org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded

INFO: Replication member added:org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=1083, securePort=-1, UDP Port=-1, id={123 126 89 39 96 -59 69 8 -113 79 51 122 25 108 -11 -110 }, payload={}, command={}, domain={}, ]

So stuck now on how to proceed, to establish why at random times the replication fails, leading to cluster collapse. Could it be the size of the session?, I have a few CFCs stuffed into session scope, but perhaps when the load is high there is too many?. Things fail even with a cluster of 2 on one server, initially I had a 8 node cluster on 2 separate machines but when it failed it rolled it back to a cluster of 2 instances on the one server to see if that was stable (its not 100% which is what I need).

Any advice, points gratefully received.

Report · Apr 11, 2014

This is just a wild guess, but is it possible with huge heaps (64GB) that during heavy loads the JVM simply can't do effective garbage collection? Or a side effect of that? And yes, putting too much stuff in session scope with lots of simultaneous users can have adverse side effects; hopefully someone more "in the know" on the inner workings of the JVM can answer that more effectively.

-Carl V.

Report · Apr 13, 2014

A 64GB heap size seems to me to be excessive. What are your maximum (Xmx) and minimum (Xms) heap sizes? How much RAM does your server have, and how much is available to each instance?

Setting Xmx equal to Xms minimizes the frequency of garbage collection. As Carl has suggested, you may be generating lots of session garbage, requiring collection. If that is the case, then you will have to set Xms to a value lower than Xmx, say, to half of Xmx. That would increase the frequency of garbage collection. However, there is a catch.

Increasing the garbage collection frequency comes at a cost. It results in a decrease in performance. You should therefore test various Xmx, Xms combinations to see which one helps. If it turns out that garbage collection frequency is not the cause of the problem, you will get optimal performance when you set the Xmx and Xms values equal.

Another critical factor is the value of the maximum heap size in relation to the available amount of RAM. That is, in relation to the amount of free RAM, which is typically less than the total amount of RAM.

The maximum heap size should be much less than the available RAM. Remember that the Java Virtual Machine uses more memory than the maximum heap size you allocate. It uses extra memory, for example, to maintain its internal administration of libraries, classes and processes such as garbage collection. Therefore, if the maximum heap size is close in value to the available RAM, a situation will eventually arise where it exceeds the RAM.

When that happens, it compels the Operating System to begin paging. The result is a drastic drop in performance. I would therefore use, for each instance, an Xmx value of about half the available amount of RAM.

Report · Apr 13, 2014

Appreciate you posting but I believe my issue has nothing to do with the amount of RAM. I have hundreds of GB of real ram, so 64 is nothing. Also there is little to no garbage collection happening as i never use the full heap.

Its the session replication issue thats hurting me, not the RAM used by the instance.

Report · Apr 13, 2014

Lynux wrote:
... I believe my issue has nothing to do with the amount of RAM. I have hundreds of GB of real ram, so 64 is nothing.

Thanks to this information, we can rule out memory as a likely cause.

Also there is little to no garbage collection happening as i never use the full heap.

We never can tell. It's all down to the dark arts of the Virtual Machine!

Its the session replication issue thats hurting me, not the RAM used by the instance.

You yourself may have hit the bull's eye. The issue may be caused by you using asynchronous replication.

An asynchronous request, or fire-and-forget request as it is sometimes called, has a performance advantage over a synchronous request, as the latter waits until the request returns, before proceeding. However, one of the side-effects of asynchronous requests is that, with them, you cannot keep track of a before-and-after sequence of events.

Asynchronous replication is the default operation in session replication, and corresponds to the setting channelSendOptions = "8". It will result in requests that return even before the replicated session has been sent over the wire and reinstantiated on all the other cluster nodes. See the last paragraph in the section 'Cluster Information' of the Apache documentation. That would explain why an attempt was made to process a session even after it had been invalidated.

Report · Apr 14, 2014

"We never can tell. It's all down to the dark arts of the Virtual Machine!"

Well thats not strictly true, if you turn garbage collection on for the JVM you use in jvm.config, you can indeed see garbage collection by the database. as well as what "generations" are being collected (perm, young etc). We can also visualize the heap because we are using FusionReactor to see memory usuage and the highest i've seen it is around the 40GB mark. In fact I have a heap dump thats 40GB large that i need to investigate as to why its so big (seemingly not collecting)

I'm going to try and run with syncronised mesaging to see if it helps, cant do that right now as we are quite bruised from the roll out and client confidence needs to addressed first.

In the meantime if anyone else has any suggestions, I'd be happy to hear them.

Report · Apr 14, 2014

I see what you mean. However, when I say 'We never can tell', I am not referring to the amount of garbage collection, but to its frequency. For, the frequency of collection is more likely to impact performance than the amount.

About the 40GB heap dump, you might want to play Sherlock Holmes and investigate whether it is related to the asynchronous replication. Who knows, invalidated sessions might be accumulating, awaiting garbage collection.

There is yet another possible cause of the issue. The package org.apache.catalina may have a bug.

Report · Apr 19, 2014

Lynux, I would agree that the jvm discussion may be a bit of a red herring. I have a couple of broad thoughts for you.

First, one thing you’ve not told us is how many sessions you have (on each instance). Have you viewed that yet, whether with the CF Server Monitor (if indeed you are using CF Enterprise, which you must be if you are using CF’s clustering and replication), or using FusionReactor 5 (which reports it in Metrics>Custom Metrics), or with a tool like the classic ServerStats from learnosity.com.

The thing is, you may be surprised to find that you have many more sessions than you may expect, whether across all the instances or perhaps even within each instance. (And you mentioned having 40g of heap that could not be GC’ed. I’d suspect sessions to be using most of that.)

In my CF server troubleshooting consulting, I help people every week discover an unexpectedly high session count, usually created by some combination of spiders, bots, load balancer pings, monitoring tools, scheduled tasks, security scanners, and so on. Any kind of automated request that does not present cookies, so that CF creates a new session on EVERY REQUEST of those sort. And naturally, the longer your session timeout, the longer the memory will be held.

And the more you do in your code at session start to put stuff into the session (trying to be helpful for real users), the more memory will be held in each session. (You mentioned putting CFCs into the session scope. If that’s part of the session initialization, you may want to reconsider whether you really want to do that for such automated requests, which will never use them in subsequent requests.)

Second, and perhaps most important to your main issue of problems with session replication, I find that one of the biggest challenges is that this problem is multiplied! All the sessions on all the instances must be replicated to each other. So for instance if you had 3 servers in a balanced cluster which might (on their own) each get 10k sessions (for the reasons I describe above), if you replicate them to each other, now there will be 30k sessions on each server (because all 10k of sessions on each must be replicated to the other in case a user might get swapped to it).

Well now what if you found that your application and traffic really would cause upwards of 100k sessions on each instance (I’ve seen it more times than most would believe)? Well, if 3 such load balanced instances (getting that traffic and having that session count) are now set to replicate amongst each other, that’s now 300k sessions!

Even if you really had just 30k sessions (heck, perhaps even if just 9k, if 3k would have otherwise existed on each of the servers if not replicated), that’s still a LOT of network traffic being sent back and forth among the instances. Keep in mind that the replication is whenever a session is created, changed, or invalidated. (And if you may be replicating across machines rather than just among instances within a machine, that’s even more potentially significant in terms of network load.)

And the replication is also a lot of work for the underlying jee engine which is responsible for the replication, which in your case is Tomcat as you say you are using CF10, which could also put stress on the JVM, on CF, on the OS, on the CPU, etc.

But I appreciate that your issue is not so much a seeming issue with “performance” of replication, but rather a failure of it. Still, perhaps this is the real root cause. Just thought I’d share it for your consideration, or for future readers of the thread.

/charlie

PS You also mention using FusionReactor 5.0.4. I’ll note that there have been a few free point releases since then, to 5.0.9 which is rather important to update to for various reasons. Then they also since came out with 5.1 (and are now up to 5.1.1). I’m pretty sure an upgrade from 5.0x to 5.1x would be free. But even if you may regard that as too bug a jump for some reason, I would recommend you update to 5.0.9. See the release notes at their site for more.

/Charlie (troubleshooter, carehart.org)

Report · Apr 20, 2014

Charlie, broad thoughts indeed. Quite valuable.

I should like to say that there was really no JVM red-herring. There was simply insufficient information at the outset about the amount of JVM and RAM. That naturally made them the prime suspects.

Once Lynux supplied the information about memory, the discussion immediately moved away from the JVM, RAM and garbage-collection. Asynchronous replication or bug is where it's now at.

In fact, I have just googled tomcat catalina "java.lang.IllegalStateException: removeAttribute: Session already invalidated". The search-results seem to support my suspicions. We may be dealing with an old, unresolved issue in Tomcat, a side-effect of asynchronous replication in Tomcat or both.

Report · Apr 22, 2014

Thank you VERY much for the response gents, I didnt reply sooner as it was a bank holiday weekend here.

We are running enterprise CF10, and we are running FR Revision: 5.0.9, Build: fusionreactor.3944.37772. I didnt know ahot the session thing in FR and its working a treat, we dont use the enterprise manager thing in CFIDE beause we use FR, it seemed redunatnt to run both.

I'm going to watch that screen in FR like a hawk, its early in the morning and already we're up to 1.5K sessions, considering real actual users start at 9AM, i find it very interesting we are already at this figure. We dot run a considerable number of scheduled tasks throughout the day (and night) so Im going to be VERY interested to see how many sessions we create during the day.

Report · Jul 10, 2014

I'm sorry, I don't have anything constructive to add, other than the fact that I am experiencing this as well. Four physical boxes, each with two child instances. We just get a ton of "java.lang.IllegalStateException: removeAttribute: Session already invalidated" errors, as well as "java.io.StreamCorruptedException: unexpected end of block data". Running latest CF update 13, and out of the box JRE, 1.6.0_29. Hoping someone eventually figures this out.

Report · Jul 11, 2014

I believe you can sort out the session already messages by changing the

to

not tried it but i think it would work. So disappointed this is not something Adobe seem keen on fixing.

Report · Jul 12, 2014

So Lynux, that’s an interesting sounding solution. Would be great if it made some difference for Guitsboy. We’ll see. I notice that you say you’ve not yet tried it, though, and fair enough. Thanks for offering it.

But I’m curious: did you ever resolve your original problem? And if not, hopefully you saw the note I just wrote to Guitsboy, asking him something that may well interest you if you still have your problem. On rereading this thread, from back in April, I’ve also had some new thoughts come to mind which I’ll share, if it may help either of you, or others with this seeming same issue.

To remind readers who may not want to review the whole thread, you had said originally that “all is fine with low to moderate load, however under heavy load and at random times the replication fails“, and that this failure “manifests in users not being able to login to our application (we store a token in session scope to store logged in status)”. Then it seems you may have concluded that things were down to the error you were seeing in the logs:

Mar 05, 2014 9:55:19 PM org.apache.catalina.ha.session.DeltaManager messageReceived

SEVERE: Manager : Unable to receive message through TCP channel

java.lang.IllegalStateException: removeAttribute: Session already invalidated

And now guitsboy reports seeing the same error.

But here’s the thing that came to mind for me tonight as I read this: you know, there can be a lot of other reasons that users can feel that they “lose their session”, even without using clustering and replication.

There are issues related sometimes to folks having duplicate session tokens (which can happen for various reasons, including perhaps ones in your code, and maybe only when people visit pages in a certain pattern, so that it happens only occasionally and not always).

Then there is an issue that can arise if you are supporting both http and https requests, where Tomcat (not CF) balks at that (see http://www.petefreitag.com/item/817.cfm, and though he shows a solution in IIS you should be able to implement a similar one in mod_rewrite if that was indeed perhaps your issue).

So I’d be curious if either of you may be in a position to have a failing client use any sort of client tool (like Chrome’s dev tools, or Firebug or Firefox’s new builtin tools, or IE’s f12 dev tools) to watch the communication between the client and the server, and especially to watch the cookies being sent. You guys both mention using jsessionid. Are they the same cookie value on each request? And/or are there more than jsessionid? I’ve seen it happen. There could be differences in the domain property reported for the cookie, the httponly property, the secure property, and so on. And you really do want to view the value sent from the client to the server, because if you view the cookie scope on the server a) it may show values set ON the server rather than sent TO the server, and b) it won’t show these additional cookie properties that were in play on the client. CF only sees the cookie name and value.

I’ve helped many people find out that this was the reason for the seeming session loss (and sometimes it was not all requests by all clients but perhaps only some requests for some clients, all on the same server). At least if this is the crux of the problem, you can then tackle WHY it’s happening. There can be many reasons, from code to configuration, so I won’t belabor them now.

But if either of you may be able to confirm this, perhaps we can help you both get a little closer to a real explanation and solution for your problem. Again, I’m just guessing a bit based on what you’ve written. I realize it may be that none of this is the problem and you have hit some other real unrelated bug. But I really feel confident that you ought to try to check this out first, as it’s indeed been the crux of problems for others, without respect to clustering. It seems worth ruling out, so that you don’t get misled chasing the problem on the assumption that it is about clustering.

As always, hope that helps.

/charlie

/Charlie (troubleshooter, carehart.org)

Report · Jul 12, 2014

GuitsBoy, do those errors really correlate to an observable problem, like users being timed out unexpectedly?

It’s just that as I review this thread (opened in April), I’m left wondering now if it could be the exception message (reported by you and Lynux, the OP) might simply mean that by the time one instance tries to replicate its session data to another instance, the session (for that sessionid) on that other instance has simply timed out. That’s what “invalidated” means. I’d never really considered it before. But since the error says it’s trying to do a “remoteAttribute”, indicating that the one instance is telling the other instance to remove a session or key in it, maybe that’s all that this message means.

But I see that Lynux has in fact followed up your note. And I’d like to not ask them this same question (I hope they’re reading this) but I’d also follow up with just a little more back to their original post. And maybe what I ask them there may help you, too, Guistboy, in resolving whatever may be your real problem.

As always, just trying to help.

/charlie

/Charlie (troubleshooter, carehart.org)

Report · Jul 14, 2014

I'm not noticing any readily observable problem, aside from filling the log files up very quickly. I will keep poking around and researching the issue. It may not be a problem at all, but I hate to see these logs fill up so quickly with such a foreboding error.

Thanks for the response,

-T

Report · Jul 14, 2014

Ah, OK, so not quite the same problem as Lynux, at least in that they had indicated that users were losing sessions, etc.

As for the log messages, perhaps there could be something done to resolve those. I wonder if those may be reflective of more a warning level rather than an error. Do either of you see any log level indicator for these? Is it prefixed with either info, warn, or error? If so, perhaps some log setting tweaking could temper their occurrence.

I realize of course that some may worry that the message indicates a problem, but I’m just saying that if you both might report that there is no seeming “real problem”, perhaps these are just indications of something inconsequential like sessions being replicated to another instance when they have timed out there.

BTW, as for the number of them, it could be that that’s driven by spiders and bots. Many never realize or notice that every page visit by a spider or both (or ping tool, including your own load balancers) creates a new session. One for every page visit (unlike for a real user, which causes only one session across all visits within a session lifespan). I’m just pointing out that the high level of messages may be due to high levels of sessions, due to high level of bot traffic.

There’s not too much one can do, but knowing it’s an explanation for high session count does sometimes bring at least some relief in the knowledge. (I will say that I’ve helped many change their ping tool URLs so that at least THOSE do not create sessions, by giving the page that’s called its own directory and its own blank or substantially emptier application.cfm/cfc, so that each such request either doesn’t create sessions at all and/or doesn’t do quite as much as the regular application.cfm/cfc for the app.)

But it would be nice to find out if we could maybe temper these session invalidated messages. Will look forward to what any may have to say/report/try.

/charlie

/Charlie (troubleshooter, carehart.org)

Report · Nov 06, 2014

A coworker of mine found the fix, it was a matter of synchronous vs asynchronous session replication in tomcat.

We needed to change channelSendOptions=8 to channelSendOptions=6. This eliminated the overwhelming majority of errors in our logs.

Apache Tomcat 6.0 (6.0.41) - Clustering/Session Replication HOW-TO

Adobe Community

Unstable session replication in a HA cluster (CF10)