Tuesday, October 10, 2006

Transactions, disks, and performance

So far I've written only a handful of blogs. My blogs are technical and dry and consequently the hit count of my blog is low. However, it has generated some interest from people in far away places and generated some interesting discussions through email. One of the things that came up is data integrity and disk performance. And that's the topic of this blog...

Maintaining data integrity

Consider a simple scenario: an MDB reads from a JMS queue, and for each message that it receives, it writes a record to a database. Let's say that you want to minimize the chance that you would lose or duplicate data in the case of failure. What kind of failure? Let's say that power failures or system crashes are your biggest worries.

How do we minimize the chance of losing or duplicating data? First of all, you want to make sure that receiving the message from the queue and the writing to the database happens in one single transaction. One of the easiest ways of doing that is to use an application server like Glassfish. How does Glassfish help? It starts a distributed (two phase) transaction. The transactional state is persisted by the JMS server and database, and by Glassfish through its transaction log. In the event of a system crash, the system will come back up and by using the persisted transactional state, Glassfish can make sure that either the transaction is fully committed or fully rolled back. Ergo, no data is lost or duplicated. The details of this are fascinating, but let's look at that some other time.

Thus, reliable persistence that is immune to system crashes (e.g. the Blue Screen of Death (BSOD)) or power failures is important. That means that when a program thinks that the data has been persisted on the disk, it in fact should have been written in the magnetic media of the disk. Why do I stress this? Because it's less trivial than you might think: the operating system and drive may cache the data in the write cache. Therefore you will have to disable the write cache in the OS on the one hand; in your application you will have to make sure that the data is indeed sync-ed to disk on the other hand.

Turning off the write cache, syncing to disk... no big deal, right? Unfortunately there are many systems in which the performance drops like a stone when you do that. Let's look at why that is, and how good systems solve this performance problem.

The expense of writing to disk

As you know, data on a disk is organized in concentrical circles. Each circle (track) is divided up in sectors. When the disk controller writes data to the disk, it has to position the write head to the right track and wait until the disk has rotated such that the beginning of the sector is under the write head. These are mechanical operations; they are not measured in nano seconds or micro seconds, but rather in milliseconds. As it turns out, a drive can do about one hundred of these operations per second. So, as a rule of thumb, a drive can perform one hundred writes per second. And this hasn't changed much in the past few decades, and it won't change in the next decade.

Writes per second is one aspect. The number of bytes that can be written per second is another.  This is measured in megabytes per second. Fortunately over the past decades this rate has steadily increased and it likely will continue to do so. That's why formatting a 300 Gb drive today doesn't take 10000 times longer than it took to format a 30 Mb drive 15 years ago.

Key is that the more you can send to the disk in one write operation, the more data you can write.  To demonstrate this, I've done some measurements on my desktop machine.
writes per second

This chart shows that the number of write operations per second remains constant up to approximately 32 kb. (Note that the x-axis is logarithmic) That means that whether you try to write 32768 bytes or just one single byte per write operation, the maximum number of writes per second remains the same. Of course the amount of data you can write to the disk goes up if you put more data in a single write operation. Put another way, if you need to write 1 kb chunks of data, you can process 32 times faster if you combine your chunks into one 32 kb chunk.
bytes per second

What you can see in this chart is that the amount of data you can write to the disk per second (data rate) increases with the number of bytes per write. Of course that can not increase infinitely: the data rate eventually becomes constant. (Note that the x-axis and y-axis are both logarithmic). The data rate at a write size of 32 kb is approx 3.4 Mb/sec and levels off at a write size of approx 1 Mb to 6 Mb/sec.

These measurements were done on a $300 PC with a cheap hard drive running Windows. I've done the same measurements on more expensive hardware with more expensive drives and found similar results: most drives perform slightly better, but the overall picture is the same.

The lesson: if you have to write small amounts of data, try to combine these in one write operation.

Multi threading

Back to our example of JMS and a database. As I mentioned, for each message picked up from the queue and written to the database, there are several write operations. For now, let's assume that the data written for the processing of single message cannot be combined into one write operation. However, if we process the messages concurrently, we can combine the data being written from multiple threads in one write operation!

Good transaction loggers, good databases that are optimize for transaction processing, and good JMS servers, all try to combine the data from multiple concurrent transactions into as few write operations as possible. As an example, here are some measurements I did against STCMS on the same hardware as above. STCMS is the JMS server that ships with Java CAPS.
messages per second

As you can see, for small messages the throughput of the system increases almost linearly with the number of threads. This happens because one of the design principles of STCMS is to consolidate as much data as possible in a single write operation. Of course when you increase the size of the messages, or if you increase the number of threads indefinitely, you will eventually hit the limit on the data rate. This is why the performance for large messages does not scale like it does for small messages. Of course when increasing the number of threads to large numbers, you will also hit other limits due to the overhead in thread switching.

Note that the measurements in this chart were done on a queue-to-queue and topic-to-topic scenario about three years ago. The STCMS shipping today performs a lot better, e.g. there's no difference in performance anymore between queues and topics, so don't use these numbers as benchmark numbers for STCMS; I'm just using them to prove the point of the power of write-consolidation.

Other optimizations in the transaction manager

Combining data that needs to be written to the disk from many threads into one write operation is definitely a good thing for both resource managers (the JMS server and database server in our example) and the transaction manager.  Are there other things that can be done specifically in the area of the transaction manager? Yes, there are: let's see if we can minimize the number of times data needs to be written for each message.

Last agent commit The first thing that comes to mind is last-agent commit optimization. That means that in our example, one of the resource managers will not do a two phase commit, but instead only do a single phase commit, thereby saving another write. Most transaction managers can do this, e.g. the transaction manager in Glassfish does this.

Piggy backing on another persistence store Instead of writing to its own persistence store, the transaction log could write its data to the persistence store of one of the resource managers participating in the transaction. By doing so, it can push its data into the same write operation that the resource manager needs to do anyway. For instance, the transaction manager can write its transaction state into the database. An extra table in the database is all it takes. Alternatively, the JMS server could provide some extensions so that the transaction log can write its data to the persistence store of the JMS server.

Logless transactions If the resource manager (e.g. JMS) doesn't have an API that allows the transaction manager to use its persistence store, the transaction manager could sneak in some data of its own in the XID. The XID is just a string of bytes that the resource manager treats as an opaque entity. The transaction manager could put some data in the XID that identifies the last Resource Manager; in the case of a system crash, the transaction manager will query all known resource managers and obtain a list of in-doubt XIDs. The presence of the in-doubt XID of the last resource in the transaction signifies that all resources were successfully prepared and that commit should be called; the absence signifies that the prepare phase was incomplete and that rollback should be called. This mechanism was invented by SeeBeyond a few years ago (patent pending). There are some intricacies in this system; perhaps a topic for a future blog? Leave a comment to this blog if you're interested.

Beyond the disk

The scenario we've been looking at is one in which data needs to be persisted so that no data gets lost in case of a system crash or power failure.  What about a disk failure? We could use a RAID system. Are there other ways? Sure! We could involve multiple computers: clustering! If each machine that needs to safeguard transaction data would also write its data to another node in a cluster, that data would survive a crash of the primary node. Assuming that a crash of two nodes at the same time is very unlikely, this forms a reliable solution. On both nodes, data can be written to the disk using a write caching scheme so that the number of writes per second is no longer the limiting factor.

We can go even further for short-lived objects. Consider this scenario:
    insert into A values ('a', 'b')
    update A set value = 'c' where pk = 'a'
    delete A where pk = 'a'
why write the value ('a', 'b') to the disk at all? Hence, a smart write cache can avoid any disk writes for short lived objects. Of course typical data that the transaction manager generates are short lived objects. JMS messages are another example of objects that may have a short life span.

A pluggable transaction manager... and what about Glassfish?

It would be nice if transaction managers had a pluggable architecture so that depending on your specific scenario, you could choose the appropriate transaction manager log persistence strategy.

Isn't clustering something difficult and expensive? Difficult to develop yes, but not difficult to use. Sun Java System Application Server 8.x EE already has strong support for clustering. Glassfish will soon have support for clustering too. And with that clustering will become available to everybody!


neel said...

Nice blog! Similar things are done by ZFS. IO aggregation is done by the zfs intent log to reduce number of IO operations. ZFS turns on the write cache and flushes it after every synchronous write to maintain data integrity.

Sridatta said...

Clustering is already available in Milestone 2 build of GlassFish v2! Checkout

edwardchou said...

Can you give an example on how to optimize disk writes on short-lived object with JMS messages? Is this implemented in anyway currently in STCMS or SunJMQ?

Frank Kieviet said...

STCMS has this log write consolidation (it's one of its strong points). Sun JMQ does not have this feature, and will rely on clustering afaik. Similarly, JMS Grid relies on multiple nodes for its reliability.

Aji said...

Frank, interesting blog! About this topic, we have similar scenario at work with Java CAPS where we audit/archive all messages received from external application, by way of branching the msg to another JMS queue which will be then persisted into a database (using an Oracle eWay). We recently asked the same question that you raised. "How do we minimize the chance of losing or duplicating data?" You mentioned that "..you want to make sure that receiving the message from the queue and the writing to the database happens in one single transaction". My question is how do I ensure that this is happening in our scenario? Is it to do with the Transaction Mode (XA/Transacted) configured in the queue?

Frank Kieviet said...

Hi Aji,

Indeed, when choosing XA for both the JMS OTD and the Oracle eWay, the chance of message loss or duplication is minimal.

The meaning of "transacted" is that the affected operation happens outside of the main transaction.