Sunday, July 30, 2006

J2EE JCA Resource Adapters: Poisonous pools

Introduction

As a developer of a  JCA Resource Adapter (RA) you're responsible for all aspects between the EIS and the EJB, including connection failures so that poisoned pools are avoided.

Wait! Too many acronyms in one sentence? Poisoned pools? What am I talking about? Here's a short refresher. JCA is the Java Connector Architecture and defines how Enterprise Java Beans (your application) can communicate with Enterprise Information Systems (EIS). Examples of Enterprise Information Systems are ERP systems, CRM systems and as well as other enterprise systems such as databases and JMS. The "conduit" between the EJB and the EIS is the Resource Adapter (RA). The communication can originate both from the EIS and from the EJB. The former is called inbound, the latter is outbound. In this write-up I'm looking at outbound only.

Creating a connection from your application to an EIS is often expensive. That is why the container (the application server) provides for connection pooling so that connections can be reused rather than getting recreated. This brings with it that there is a risk that faulty connections accumulate in the pool, thus causing a poisoned pool. In such a situation the application can no longer communicate with the EIS.

Seem's simple enough, doesn't it? Let's look a bit closer...

Typical time scales

One of the services that an application server provides to applications is the pooling of resources. As such, when your application uses an outbound connection of a resource adapter, the application server will maintain a pool of connections. When your application needs a connection, the application server tries to satisfy this request first by checking the pool of idle connections; if there are no idle connections, a new connection is created or the application is blocked until a connection is returned to the pool. When the application closes a connection, it is returned to the pool.

Creating a new connection typically involves creating one or more new TCP/IP connections, authentication by the EIS, creating an internal session in the EIS and its associated memory structures and data, etc. This makes creating a new connection expensive, the time scale of connection creation is usually in the order of 50-300 ms. These expensive operations can be avoided when reusing an idle connection: the time scale of reuse is measured in microseconds rather than microseconds. Next to consider is the time scale of connection use by your application, typically often in the range of 1-5 ms.

To show the effects of connection pooling on throughput, let's assume that your application takes 3 ms to process a request, and that the time it takes to create a new connection is 300 ms, while re-using a connection takes 0.03 ms. The processing time is 3.03 ms with pooling, and 303 ms without pooling. A difference of a factor hundred! Sure, I made up the numbers in this example, but they are likely close to what you'll encounter in every day practice.

In addition to the sub-second time scale of connection use and creation, there is another timescale to consider: the typical time scale of the duration of a failure. Communication failures are most often caused by the EIS becoming unavailable temporarily. This can have two causes: a loss of network connectivity, or because of the EIS being restarted. The latter is more probable and will be considered here as the typical failure scenario. Restarting an EIS typically takes from half a minute to several minutes. It is important to keep this timescale in mind when considering error handling strategies.

Mechanics of connection pooling

An outbound connection from the application to the EIS is represented by a ManagedConnection. The ManagedConnection holds the physical connection to the EIS. The lifecycle of a ManagedConnection is under control of the application server: a resource adapter creates a ManagedConnection when the application server tells it to; likewise the resource adapter destroys a connection only when instructed to do so by the application server.

A problem occurs when there is a communication failure with the EIS. For example, if a resource adapter connects to an external EIS, and this external server is restarted, the connections in the pool are all invalid. If the application were to use one of these connections, a failure would certainly occur. The failure would be propagate to your application code through an exception. A likely result would be that the transaction would be rolled back, and that the operation would be attempted again. The application server may not be able to distinguish this communication failure due to a faulty connection from other errors, so it may use the same faulty connection again on the next attempt, thereby ensuring that the same problem will happen for the next transaction. Effectively, the whole application has become inoperable because of the “poisoned pool”. To break out of this cycle, the resource adapter should let the application server know that connections are faulty so that the application server then can make the resource adapter recreate a new connection and avoid putting faulty connections back in the pool. There are several ways to do this.

  • Signal to the application server that a connection is no longer valid
  • Respond negatively when the application server asks whether a connection is valid

Signalling the application server that a ManagedConnection is faulty

After the application server instructs the Resource Adapter to create a new ManagedConnection, it calls the following method on that ManagedConnection:

public interface ManagedConnection
{
   addConnectionEventListener(ConnectionEventListener listener)

    // other methods omitted for clarity
}

The managed connection uses this ConnectionEventListener object to notify the application server of connections being closed. This object can also be used to let the application server know that the connection is faulty using the CONNECTION_ERROR_OCCURRED event. Upon receiving this event, the application server will typically destroy the connection immediately.

This approach can only be used if the resource adapter has some way of finding out when the connection is broken. In practice this turns out to be quite difficult: most resource adapters are not written from from scratch but rather make use of some client jar that takes care of the communication with the EIS. Often, the vendor of the resource adapter is not the vendor of the EIS, and even if this were the case, the vendor of the EIS most likely needs to make a client jar available independent from a resource adapter anyways. It is not uncommon that these client jars don't provide any mechanism to propagate connection failures to the caller. For instance, if the EIS is JMS, the client jar will expose only the JMS API and there is no way in the JMS API to tell the caller of a method that the physical connection is faulty.

Because the application server will destroy the connection immediately upon receiving the CONNECTION_ERROR_OCCURRED event and will roll back the transaction, it is important that the ManagedConnection does not throw this event as a result of an application condition rather than a faulty connection.

Fortunately there are alternatives to the CONNECTION_ERROR_OCCURRED mechanism.

The ValidatingManagedConnectionFactory

Rather than telling the application server that the connection is faulty, the managed connection can also wait for the application server to ask the managed connection whether it is still a valid connection. To make that work, the managed connection factory needs to implement the ValidatingManagedConnectionFactory interface:

public interface
ValidatingManagedConnectionFactory 
{
    Set getInvalidConnections(Set connectionSet) 
}

The application server will check if the managed connection factory implements this interface, and if so, it will call the getInvalidConnections() method periodically.

Again it is often a problem for the managed connection factory to know if a connection is valid or not. For example, if the resource adapter wraps a database connection, there is often no way to find out if a connection is still "live" or not. E.g. on a JDBC connection the isClosed() method does not return any status information on the physical connection but only returns whether the close() method was called.

Passive negative checks
One way for a managed connection to keep track of possible connection problems is to monitor exceptions being thrown from the client runtime to the application. If there is a way for the ManagedConnection to discern application errors (e.g. a  syntax error in a prepared statement) from communication failures, the managed connection can assume that it may be faulty if the exception count is greater than zero. For example, if the resource adapter wraps a JMS provider, it could reasonably assume that exceptions from methods like send(), publish() etc. indicate connectivity problems.

Passive positive checks
If it is not possible to passively monitor connection failures, perhaps it is possible to keep track of when a connection was used without any problem for the last time. If a connection was not used for more than say 30 seconds, you could mark that connection as invalid. Of course there is a risk that the application uses the connection less often than once every 30 seconds; if that is the case, the expense of recreating a connection may not be that bad.

Active validity check
Another way is for the managed connection to actively check the connection validity. If the managed connection uses an Oracle connection underneath, it could do a select on the DUAL table. However, it is important that this check is not very time consuming.

Keep an eye on expenses!
Above it was mentioned  that the applicaton server will call the getInvalidConnections() method periodically. How often does the application server call this method? The application server may have a timer thread that will go over all the idle connections in the connection pool and check to see if they are still valid. There are some serious problems with this approach if it is the only time that the application server calls this method: when the system is processing at or near capacity, the application server will hardly ever find idle connections in the pool.

That's why application servers typically will call the getInvalidConnections() method before it gives out a connection to the application. A simple but expensive approach is for the application server to call this method every time an application is given to the application. A smarter approach is to do this not more often than every so many seconds, a value that is configurable for the server. This value is chosen based on the expected failure duration. As was mentioned earlier, the expected failure duration is likely greater than 30 seconds. Hence, it makes little sense for the application server to call getInvalidConnections() more often than every 30 seconds.

Keep in mind however that there is no standard on what application servers do, so it is important to make sure that the getInvalidConnections() method is fast on average. If calling an expensive method is the only way to find out if a connection is valid, the managed connection factory could keep track of when it was called last, so that it will not call this expensive method more than every so many seconds. A guess can be made what a reasonable time span is by looking at how expensive the check is, and keeping the timescales of connection failures in mind, again 30 seconds being a reasonable ballpark number.

Desparate measures
If there's really no way for the managed connection to find out anything about the validity of the connection, it could resort to a crude but effective workaround: it can set a limit on the lifetime of the connection, e.g. one minute. Again, this time interval is based on the timescales of connectivity failures. This will have a small adverse effect on performance: connections are destroyed and recreated more often than they need to be. This effect will not be very big however: most of the time the application can in fact reuse an existing connection. In the example above with a connection time of 300 ms, the throughput goes down by 0.5% when the maximum lifetime of a connection is 1 minute.

Should a connection failure occur, the faulty connection will be reused for less than one minute, so the problem will eventually correct itself. If the application is used continuously, and if the expected downtime is more than one minute, this will not make any difference to the application because during the one minute in which the connection is faulty: the EIS is unavailable anyways.

Complicating factor: transaction enlistment

Resource adapters declare in the ra.xml what level of transactions they support. There are three levels: XATransaction, LocalTransaction and NoTransaction. If a resource adapter supports XATtransaction, this means that the resource adapter supports XA; the application srver will call getXAResource() on the ManagedConnection to get hold of the XAResource object to control the transaction. Resource adapters that support LocalTransaction return an instance of the LocalTransaction class when the application server calls getLocalTransaction() on the ManagedConnection. This interface has methods begin(), commit() and rollback(). Resource adapters that only support NoTransaction don’t participate at all in transactions.

If a resource adapter supports XATransaction, the managed connection will have to be enlisted each time for every transaction. The transaction manager in the application server will call start() on the XAResource. The start() call is the very first method that the application server calls on the managed connection after getting it out of the pool. The start() method will typically call into theEIS, causing an exception if the exception is faulty. The best way of dealing with this is for the application server to discard the connection, i.e. call destroy() and remove the connection from the pool. Some application servers (e.g. the Integration Server in Java CAPS) do that. So for these application servers it may suffice to do nothing in your resource adapter and still avoid poisoned connection pools. However, there are plenty of other application servers that will propagate the exception to the application and return the connection to the pool. And you do want your resource adapter to work well with any application server, don't you?

For application server that don't destroy connections when the enlistment fails, it is critical that the resource adapter has to provide for a fault detection strategy. Unfortunately, an exception on the start() method is difficult to detect for most resource adapters, because resource adapters often expose the XAResource from the client runtime directly to the application server's transaction manager. There's good reason for this, because there's an inherent problem with XAResource wrappers as I noted in my previous blog. In these situations the passive positive check as I outlined above may be useful.

Conclusion

When developing a resource adapter, it's crucial to provide for connection failure detection. Keep in mind:

  • different application servers behave differently, e.g. different frequency of calling getInvalidConnections(), different behavior when the enlistment of a connection fails
  • transaction enlistment failures may be the only failures that occur; can you detect them?
  • There are different ways of guessing if a connection is valid, even if the monitoring of failures doesn't work:
    • track when a connection was used without failure
    • assign a maximum lifetime
  • Keep an eye on expenses! Make sure that connections are not recreated every time, and make sure that active health checks don't happen too often.
With all this, keep in mind the different time scales:
  • how long it takes to create a new connection
  • how long a typical connection failure lasts
  • how many requests an application is likely to process per second

6 comments:

Michel said...

Very good article and so truth

Vinayagam Kulandaivel said...

Frank,

Very nice article, Hope detecting invalid connections has been implemented in J2CA 1.5. Is there any possible to detect and destroy the underlying faulty connections in J2CA 1.1 versions. Myself using BEA Weblogic 8.1 SP5 with Neon Shadow Adapter v6.

Regards

Vinayagam. K

Frank Kieviet said...

Re Vinayagam,

It's been a long time that I looked at JCA 1.1; I'm not aware of any changes in faulty connection handling.

In either way, it's mainly up to the connector to implement a sensible faulty connection detection strategy.

Frank

vick said...

Hi,

Does a JCA connection pool use 'object pooling' ?

The reason I ask is that I have heard that a connection pool can either be constructed by using object pooling or without using it.

thanks

vatsal

Frank Kieviet said...

Re Vick:

I'm not sure what you mean with object-pooling. The connection object is pooled in the connection pool, i.e. it's being reused time and time again to avoid creating new connections.

Frank

Sivaraman said...

Hi Frank,

I am having aa JCA Resourcce Adaptor for one of the Billing product for Telco's. The adaptor gives a wsdl with jca bindings. The endppoint in the wsdl is the JNDI name.

Can you guide is there any generic samples on generating a WS client for a WSDL with JCA bindings?