Fun Of Math

Friday, March 21, 2008

An Implementation of Pessimistic Locking with JPA

I've worked with a Java application for several years that has relied upon pessimistic since before I was hired back in 1999. This was a time before Hibernate and certainly the newer Java Persistence Architecture. I have rewritten the data layer for some of our transactional processing a few times since then as the application changed from a dot-com to a B2B dot-com to a startup with a successful business, and now, to a department acquired by a large company. The data layer that existed when I joined was an interesting XML object database stored in SQL Server 7. All who touched it will not forget the powerful lesson we learned, so I will not talk anymore of that here. Actually I do need to mention that it used pessimistic locking for doing transaction processing on those XML clobs. That is the last mention of it. At some point we got a bit smarter and decided that we ought to be using our relational database to store records structured in tables defined by a schema.

After moving to a more typical object-relational mapping in Java, our system still depended upon pessimistic locking to implement transaction processing. This was accomplished in SQL Server 7, and later SQL Server 2000, using REPEATABLE_READ isolation for read-only connections and READ COMMITTED isolation for read-write connections. In a read-only connection the REPEATABLE_READ isolation ensured that shared locks in the database were held for the duration of a database transaction. Read-write connections relied on doing an UPDATE on rows before reading them to get an exclusive lock that was held for the duration of the database transaction. This effectively allowed concurrent reads, but did not allow writes to be concurrent with reads or other writes. This provides consistent reading of a header object and its line items by locking the header row with a read-only shared lock (REPEATABLE READ and a SELECT) or an exclusive write lock (READ COMMITTED with an UPDATE). It worked, but it meant that locks are held on data that is only being read.

SQL Server 2005 provides a feature called row level versioning that is similar to multiversion row concurrency (MVCC) in PostgreSQL and other databases. The SNAPSHOT isolation and READ COMMITTED isolation with row-level versioning are features of SQL Server 2005 that allow reading that does not take any locks. The SNAPSHOT isolation ensures that all of the data read in the transaction is unaffected by updates that occur during the read-only transaction. This means we do not need to take a read lock in REPEATABLE READ isolation in order to have consistency. When writing we still just need to do an UPDATE in order to get an exclusive lock on the database row.

Microsoft has a lot of information on the row version concurrency new in SQL Server 2005. When reading in SNAPSHOT isolation the database reads "all data that was committed before the start of each transaction." Updates are a little more complex:

Uses row versions to select rows to update. Tries to acquire an exclusive lock on the actual data row to be modified, and if the data has been modified by another transaction, an update conflict occurs and the snapshot transaction is terminated.

This is just what the we want. (In PostgreSQL the SERIALIZABLE isolation is nearly identical to SNAPSHOT isolation of SQL Server. The READ COMMITTED isolation of PostgreSQL is also nearly identical to that of SQL Server when the READ_COMMITTED_SNAPSHOT setting of SQL Server is turned on.) It may not be obvious even if you read the Microsoft link that this SNAPSHOT mode of SQL Server is server-side optimistic locking on the database server. The JPA specification (section 3.4.3) requires these behaviors of optimistic locking:

If transaction T1 calls lock(entity, LockModeType.READ) on a versioned object, the entity
manager must ensure that neither of the following phenomena can occur:

P1 (Dirty read): Transaction T1 modiﬁes a row. Another transaction T2 then reads that row and obtains the modiﬁed value, before T1 has committed or rolled back. Transaction T2 eventually commits successfully; it does not matter whether T1 commits or rolls back and whether it does so before or after T2 commits.

P2 (Non-repeatable read): Transaction T1 reads a row. Another transaction T2 then modiﬁes or deletes that row, before T1 has committed. Both transactions eventually commit successfully.

Now I shall explain pessimistic locking using these tools and the Java Persistence Architecture (JPA), which is really pretty simple. The JPA does not specify direct support for this. It has separate find() and lock() methods of the EntityManager. One could do this:

public Object findAndLock(EntityManager em, Class c, Object key) {
    Object e = em.find(c, key);
    em.lock(e);
    return e;
}

This is pretty good, but it's only approximation of an atomic find-and-lock method. In a system with high concurrency, we will encounter JPA OptimisticLockExceptions. Let us presume we have an application with a legacy data layer (not JPA) that has this high concurrency, and let us also presume that we cannot remove all of this concurrency. (Often I find the serenity prayer is useful for constrained design.) The problem above is that the find() loads the row before we lock() it. If another thread also tries to find() and then lock(), only one can succeed, and the other will fail, either from the JPA locking rules defined in the JPA specification, or from the rules defined for SNAPSHOT isolation. The result of the failure is an exception, and the application code would have to retry the entire transaction, but that is one of the things we must accept we cannot change---not immediately anyway.

There is something we can do to avoid this race condition. We just have to do the lock() first rather than find() first. The lock() method takes an entity as a parameter, but we have a way out of the catch-22 because the entity need not be read from the database yet. The EntityManager.getReference() method gives us an object that looks like the entity we would get from find(), but the EntityManager.getReference() method can give us "an instance, whose state may be lazily fetched." So we use this code:

public Object findAndLock(EntityManager em, Class c, Object key) {
    Object e = em.getReference(c, key);
    em.lock(e);
    return e;
}

The reference returned is a proxy instance created with java.lang.reflect.Proxy.newProxyInstance(). The proxy instance can return just the primary key from the entity reference that would have been passed to getReference(). That way the lock() method can get the entity key and lock the row for the entity without first loading it. This removes the race possibility and gives us true pessimistic locking.

I hope that some JPA implementations will offer a solution for pessimistic locking without requiring special API extensions. This solution is one that I've implemented in my own partial JPA implementation, and it works very well. I have not used any other JPA implementations or even Hiberrnate, so there may be a better way to do this. I have only see references to changing the transaction isolation level, and I hope what I have written explains why that is not a solution.

Labels: database, java, MVCC, ORM, programming

Wednesday, February 13, 2008

Consider the Source

It has been told to me many times. "Don't believe everything you read." Sometimes I still believe stuff that I simply should not. For instance, I've been trying to figure out why the javax.print.attribute.standard.MediaSize.findMedia() method starts doing strange stuff after my application has been running. According to the documentation for findMedia(), it tries to find the best match of the "standard" media sizes:

The specified dimensions are used to locate a matching MediaSize instance from amongst all the standard MediaSize instances. If there is no exact match, the closest match is used.

Indeed it seems to work when I try it, so I accept what I've read, until someone complains that the application starts failing to find the right media size. It seems that the definition of "standard MediaSize instances" changes after the application has been running for a while. I solve this by changing my code not to rely on MediaSize.findMedia(), and I'm happy that the bug is fixed, but what was really going on to cause the problem?

Fortunately, Sun has always (or least as far as I can remember) released source code for the standard library of Java, and I as a Java developer of course have a copy of their source code installed on my system, so I can take a look at what my call to MediaSize.findMedia() is really doing. The findMedia(float x, float y, int units) simply loops through a list of MediaSize entries, and looks for the one that is closest to the given dimensions. It then returns the "name" of the closest fit. Where this contradictions my assumptions is that the definition of "standard MediaSize instances" is the entire set of MediaSize instances EVER INSTANTIATED! This means that every time I call new MediaSize(x,y,units), I'm adding a new entry to the "standard" set of instances, and those do not have names on them, so findMedia(), which returns the name of the closest match, which start returning null, the name of one of my instances. Okay, now I know why my code was failing, but after looking at the code a bit more, I see a couple curious things. First I notice that the code was written by former C programmers because of the style. Then I notice that it's unnecessarily using some local variables of type double when it just needs float. Neither of these two things matter, but I notice such things.

However, when I think about things a little more, I realize that the "standard" list of media sizes is a java.util.Vector that gets an entry for every instance of MediaSize ever made. This list is never, ever cleared, and that means it's a memory leak, and one that is encountered every time the users of the site do a particular task. It is a good thing that memory is cheap and that I do not create these things in a loop that would allocate a lot of them that never get garbage collected. It does mean that I need to remove all calls to instantiate MediaSize objects as part of a user initiated action.

At least I feel I can can trust the source code.

Labels: bugs, java, programming, source code

Tuesday, May 08, 2007

The Difference Between Premature Optimization and Negligent Design

Computer programmers have all heard the the Donald Knuth quote (ed. paraphrasing something by C.A.R. Hoare) that "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." But just what constitutes "premature optimization" can be a subjective matter. Some coworkers and I were talking about this, and I think we mostly agreed on that point. First, what is "premature"? If I have implemented a similar solution ten times, and experience tells me that the current situation is the same as those, then I may choose to use the optimized implementation that I have used before. If I am a conscientious developer, this will not involve a cut-and-paste of the solution, but a call to an appropriate abstraction for this special case.

Secondly, what is "optimization"? When designing code, good developers will use data structures and other abstractions that semantically fit the problem. This is not optimization; it is simply appropriate design. The opposite of good design is negligence. A good characterization of "optimization" comes from Wikipedia (premature) optimization is "a situation where a programmer lets performance considerations affect the design of a piece of code." The design of the code can be quite low level in the above case.

Here is a simple example of one clarification of the difference between premature optimization and good design. Let us assume that we wish to make a collection of unique objects, such as names of subscribers on ten mailing lists. Some people may be subscribed to more than one of the lists, but we wish to have just once instance of the person in the collection. The code in Java might look like this:

Set extractSubscribers(Collection mailingLists)
{
  Set subscriberNames = new HashSet();

  for (MailingList mailingList : mailingLists) {
    for (Subscriber subscriber : mailingList.getSubscribers()) {
      subscriberNames.add(subscriber.getName());
    }
  }
  return subscriberNames;
}

The HashSet is a Set, which means that it enforces that each of the names it holds is unique. That is what we wanted, so the design of the code has clear semantics that match the functional requirement. I could have used a TreeSet in most situations, but that adds a requirement of total ordering on the data items, which I have, but that is not important right now. HashSet is nice because its lookup and insertion complexity is O(1), but that is also not too important right now.

Now let us look at a less desirable approach. Today I came across this quote from a computer programmer.

For most of my programming, I just use ArrayList or Vector and call Contains before inserting. It's lazy, but I've never had to deal with a big enough list to make it worth my time.

I would imagine his code looks something like this:

Collection extractSubscribers(Collection mailingLists)
{
  Collection subscriberNames = new ArrayList();

  for (MailingList mailingList : mailingLists) {
    for (Subscriber subscriber : mailingList.getSubscribers()) {
      if (!subscriberNames.contains(subscriber.getName())) {
        subscriberNames.add(subscriber.getName());
      }
    }
  }
  return subscriberNames;
}

There is only one extra line of code in this case, but that is due to the fact that the ArrayList datatype holding our unique subscriber names does not have the semantics we want. It also means that we would have to write even more code to get the method signature to reflect the semantics of the requirements. If we wanted the return type to be Set, we would have to put all of the subscriber names in a set anyway, but I shall stipulate that literate programming is not important to the person making the quote above. However, it would make putting everything in an ArrayList first look pretty strange.

The code in the second example loses some semantic meaning because the return type does not indicate that the names in the collection are unique, but it also is considerably slower than the code using a HashSet. Because the time complexity of the first example is O(n), and that of the second example is O(n^2), finding the unique set of subscriber names in the second example will be roughly 1000 times slower than for the first example when there are 1000 subscribers.

This complexity issue may not seem important in the age of GigaHertz computing, but a coworker of mine found that the implementers of Apache Ant, a build system for Java, did not seem to appreciate this. The software I develop at my job results in about 15,000 Java class files, and those all have to be copied by an ant "copy" task in the build target directory. Ant 1.6 and ant 1.7 both build a list of unique file names to copy using a Vector (like an ArrayList) collection. In our build, this results in about 30 seconds to one minute of 100% CPU utilization on a 2GHz Core Duo! That is 25% to 40% of the total build time!

Fortunately, Ant is an open source project, so changing the code to use a Set rather than a Vector is almost trivial, and we have done so for our own use. It makes the builds 25% in most cases. All of that CPU time is wasted because the code was written using a datatype that was not appropriate and was also not semantically correct. Optimization could now be done if we decided that the semantically correct code had performance issues. In our example, the HashSet does have more overhead than the ArrayList or Vector, but that overhead is nearly constant for each name, so the performance does not degrade horribly when the number of names increases. However, for small data sets the overhead per name, which is negligible in our example, could be measurably significant compared to the cost of adding each name to the collection. The HashSet has to get the hash code for each name and do some comparisons that take more time than simply setting the next element of an array, but for a good implementation of HashSet, this overhead is not meaningful, and when the number of names is small, the whole process runs quickly anyway. It is only large numbers of names that cause problems, and they cause much worse problems when the wrong design is chosen. Avoiding the overhead of HashSet is one of the "small efficiencies" that is meant by Knuth.

Start with a correct design, and then optimize as needed, but do not mistake correct design for premature optimization.

Labels: programming

Thursday, January 18, 2007

I want a new language

C was great. Then C++ was great. Then Java was great. (Along the way Lisp was neat, too.) Imperative programming, however, just isn't cutting it anymore. I need to make myself do some functional programming, the wave of the future I'm told. New wave or not, functional programming is interesting, and functional features are finding their ways into Java and C#. I propose that I should do something simple in Erlang, Haskell, SML or OCaml, Scheme, and maybe Ruby or Python. My friend René said he always wrote a Mandelbrot set generator on a new platform to get the feel of programming on it, so I'll do that.

Labels: programming

Monday, September 18, 2006

The Little Schemer

I just finished reading The Little Schemer for the first time. It's a wonderful little book for an introduction to the Scheme programming language.

The book starts off slowly, and is somewhat tedious, but the exercises are good, and the book quickly requires the reader to think a lot about the questions it asks. The penultimate chapter develops a derivation of the applicative order Y combinator and requires several readings to get the gist of it. I'm still not sure that I understand it. The final chapter gives an implementation of a simple Scheme interpreter written in Scheme.

Labels: books, programming