So far I've written only a handful of blogs. My blogs are
technical and dry and consequently the hit count of my blog is low.
However, it has generated some interest from people in far away places
and generated some interesting discussions through email. One of the
things that came up is data integrity and disk performance. And that's
the topic of this blog...
Maintaining data integrity
Consider a simple scenario: an MDB reads from a JMS queue, and for each message that it receives, it writes a record to a database. Let's say that you want to minimize the chance that you would lose or duplicate data in the case of failure. What kind of failure? Let's say that power failures or system crashes are your biggest worries.
How do we minimize the chance of losing or duplicating data? First
of all, you want to make sure that receiving the message from the
queue and the writing to the database happens in one single
transaction. One of the easiest ways of doing that is to use an
application server like Glassfish. How does Glassfish help? It starts a
distributed (two phase) transaction. The transactional state is
persisted by the JMS server and database, and by Glassfish through its
transaction log. In the event of a system crash, the system will come
back up and by using the persisted transactional state, Glassfish can
make sure that either the transaction is fully committed or fully
rolled back. Ergo, no data is lost or duplicated. The details of this
are fascinating, but let's look at that some other time.
Thus, reliable persistence that is immune to system crashes (e.g.
the Blue Screen of Death (BSOD)) or power failures is important. That
means that when a program thinks that the data has been persisted on
the disk, it in fact should have been written in the magnetic media of
the disk. Why do I stress this? Because it's less trivial than you
might think: the operating system and drive may cache the data in the
write cache. Therefore you will have to disable the write cache in the
OS on the one hand; in your application you will have to make sure that
the data is
indeed sync-ed to disk on the other hand.
Turning off the write cache, syncing to disk... no big deal, right?
Unfortunately there are many systems in which the performance drops
like a stone when you do that. Let's look at why that is, and how good
systems solve this performance problem.
The expense of writing to disk
As you know, data on a disk is organized in concentrical circles.
Each circle (track) is divided up in sectors. When the disk controller
writes data to the disk, it has to position the write head to the right
track and wait until the disk has rotated such that the beginning of
the sector is under the write head. These are mechanical operations;
they are not measured in nano seconds or micro seconds, but rather in
milliseconds. As it turns out, a drive can do about one hundred of
these operations per second. So, as a rule of thumb, a drive can
perform one hundred writes per second. And this hasn't changed much in
the past few decades, and it won't change in the next decade.
Writes per second is one aspect. The number of bytes that can be
written per second is another. This is measured in megabytes per
second. Fortunately over the past decades this rate has steadily
increased and it likely will continue to do so. That's why formatting a
300 Gb drive today doesn't take 10000 times longer than it took to
format a 30 Mb drive 15 years ago.
Key is that the more you can send to the disk in one write
operation, the more data you can write. To demonstrate this, I've
done some measurements on my desktop machine.
This chart shows that the number of write operations per second
remains constant up to approximately 32 kb. (Note that the x-axis is
logarithmic) That means that whether you try to write 32768 bytes or
just one single byte per write operation, the maximum number of writes
second remains the same. Of course the amount of data you can write to
the disk goes up if you put more data in a single write operation. Put
another way, if you need to write 1 kb chunks of data, you can
process 32 times faster if you combine your chunks into one 32 kb chunk.
What you can see in this chart is that the amount of data you can
write to the disk per second (data rate) increases with the number of
bytes per write. Of course that can not increase infinitely: the data
rate eventually becomes constant. (Note that the x-axis and y-axis are
both logarithmic). The data rate at a write size of 32 kb is approx 3.4
Mb/sec and levels off at a write size of approx 1 Mb to 6 Mb/sec.
These measurements were done on a $300 PC with a cheap hard drive
running Windows. I've done the same measurements on more expensive
hardware with more expensive drives and found similar results: most
drives perform slightly better, but the overall picture is the same.
The lesson: if you have to write small amounts of data, try to
combine these in one write operation.
Back to our example of JMS and a database. As I mentioned, for each
message picked up from
the queue and written to the database, there are several write
operations. For now, let's assume that the data written for the
of single message cannot be combined into one write operation. However,
if we process the messages
concurrently, we can combine the data being written from multiple
threads in one write operation!
Good transaction loggers, good databases that
are optimize for transaction processing, and good JMS servers, all try
to combine the data
from multiple concurrent transactions into as few write operations as
possible. As an example, here are some measurements I did against STCMS
on the same hardware as above. STCMS is the JMS server that ships with
As you can see, for small messages the throughput of the system
increases almost linearly with the number of threads. This happens
because one of the design principles of STCMS is to consolidate as much
data as possible in a single write operation. Of course when you
increase the size of the messages, or if you increase the number of
threads indefinitely, you will eventually hit the limit on the
data rate. This is why the performance for large messages does not
it does for small messages. Of course when increasing the number of
threads to large numbers, you
will also hit other limits due to the overhead in thread switching.
Note that the measurements in this chart were done on a
queue-to-queue and topic-to-topic scenario about three years ago. The
STCMS shipping today performs a lot better, e.g. there's no difference
in performance anymore between queues and topics, so don't use these
numbers as benchmark numbers for STCMS; I'm just using them to prove
the point of the power of write-consolidation.
Other optimizations in the transaction manager
Combining data that needs to be written to the disk from many
threads into one write operation is definitely a good thing for both
resource managers (the JMS server and database server in our example)
and the transaction manager. Are there
other things that can be done specifically in the area of the
transaction manager? Yes, there are: let's see if we can minimize
the number of times data needs to be written for each message.
Last agent commit The first
thing that comes to mind is last-agent commit optimization.
That means that in our example, one of the resource managers will not
do a two phase commit, but instead only do a single phase commit,
another write. Most transaction managers can do this, e.g. the
transaction manager in Glassfish does this.
Piggy backing on another
persistence store Instead of writing to its own persistence
store, the transaction log could write its data to the persistence
store of one of the resource managers participating in the transaction.
By doing so, it can push its data into the same write operation that
the resource manager needs to do anyway. For instance, the transaction
manager can write its transaction state into the database. An extra
table in the database is all it takes. Alternatively, the JMS server
could provide some extensions so that the transaction log can write its
data to the persistence store of the JMS server.
Logless transactions If the
resource manager (e.g. JMS) doesn't have an API that allows the
transaction manager to use its persistence store, the transaction
manager could sneak in some data of its own in the XID. The XID is just
a string of bytes that the resource manager treats as an opaque entity.
The transaction manager could put some data in the XID that identifies
the last Resource Manager; in the case of a system crash, the
transaction manager will query all known resource managers and obtain a
list of in-doubt XIDs. The presence of the in-doubt XID of the last
resource in the transaction signifies that all resources were
successfully prepared and
that commit should be
called; the absence signifies that the prepare phase was incomplete and
that rollback should be
called. This mechanism was invented by SeeBeyond a few years ago
(patent pending). There are some intricacies in this system; perhaps a
topic for a future blog? Leave a comment to this blog if you're
Beyond the disk
The scenario we've been looking at is one in which data needs to be
persisted so that no data gets lost in case of a system crash or power
failure. What about a disk failure? We could use a RAID system.
Are there other ways? Sure! We could involve multiple computers:
clustering! If each machine that needs to safeguard transaction data
would also write its data to another node in a cluster, that data would
survive a crash of the primary node. Assuming that a crash of two nodes
at the same time is very unlikely, this forms a reliable solution. On
both nodes, data can be written to the disk using a write caching
scheme so that the number of writes per second is no longer the
We can go even further for short-lived objects. Consider this
insert into A values ('a', 'b')
update A set value = 'c' where pk = 'a'
delete A where pk = 'a'
why write the value ('a', 'b') to the disk at all? Hence, a smart write cache can avoid any disk writes for short lived objects. Of course typical data that the transaction manager generates are short lived objects. JMS messages are another example of objects that may have a short life span.
A pluggable transaction manager... and what about Glassfish?
It would be nice if transaction managers had a pluggable
architecture so that depending on your specific scenario, you could
choose the appropriate transaction manager log persistence strategy.
Isn't clustering something difficult and expensive? Difficult to
develop yes, but not difficult to use. Sun Java System Application
Server 8.x EE already has strong support for clustering. Glassfish will
soon have support for clustering too. And with that clustering will
become available to everybody!