First in the production environment, during database failover (MSSQL Server cluster), and later in testing, we encountered processes that are 'stuck' on some random operation. Admin console doesn't show stalled operation or branches, process is running, all operations are completed except last which was in several cases still in status Running, and in one case it was Completed but next operation was not started.
When operation is still running one would expect to select it and click 'Retry' in hope that process will continue but it seems there exist a bug in "Process Instance Detail" part of the admin console - it gives an error: "This operation cannot be completed without the selection of one or more operations from the list below!"
We can try using 'Terminate' for this operation and probably several that follow (because they would probably stall without data from first terminated operation) in hope that process which runs in big user loop can arrive at a step where it can recover from all this using existing data but it's messy and doesn't always work.
When the operation is in status Completed admin cannot do anything at all. This completed operation cannot be retried or terminated because selection box is grayed out. Process and data in it is irrecoverably lost and must be initiated from start.
I must say we had large problems before if database connection is lost even for few seconds (intentional fail over from one db node to another) - but now, when we filled check-connection-sql in datasource configuration Livecycle is behaving a lot better but still with above problems.
How can it happen that LC stops executing process workflow and stops on some operation just because it wasn't able to get valid db connection for several seconds? Is it some problem with transactions or what? What can I do? Where to look?
We are still on LC ES2 SP3. JBoss on Windows Server 2003 and MSSQL 2005.
Any help is appreciated.