Tuesday, May 08, 2007

The Difference Between Premature Optimization and Negligent Design

Computer programmers have all heard the the Donald Knuth quote (ed. paraphrasing something by C.A.R. Hoare) that "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." But just what constitutes "premature optimization" can be a subjective matter. Some coworkers and I were talking about this, and I think we mostly agreed on that point. First, what is "premature"? If I have implemented a similar solution ten times, and experience tells me that the current situation is the same as those, then I may choose to use the optimized implementation that I have used before. If I am a conscientious developer, this will not involve a cut-and-paste of the solution, but a call to an appropriate abstraction for this special case.

Secondly, what is "optimization"? When designing code, good developers will use data structures and other abstractions that semantically fit the problem. This is not optimization; it is simply appropriate design. The opposite of good design is negligence. A good characterization of "optimization" comes from Wikipedia (premature) optimization is "a situation where a programmer lets performance considerations affect the design of a piece of code." The design of the code can be quite low level in the above case.

Here is a simple example of one clarification of the difference between premature optimization and good design. Let us assume that we wish to make a collection of unique objects, such as names of subscribers on ten mailing lists. Some people may be subscribed to more than one of the lists, but we wish to have just once instance of the person in the collection. The code in Java might look like this:
Set extractSubscribers(Collection mailingLists)
{
Set subscriberNames = new HashSet();

for (MailingList mailingList : mailingLists) {
for (Subscriber subscriber : mailingList.getSubscribers()) {
subscriberNames.add(subscriber.getName());
}
}
return subscriberNames;
}
The HashSet is a Set, which means that it enforces that each of the names it holds is unique. That is what we wanted, so the design of the code has clear semantics that match the functional requirement. I could have used a TreeSet in most situations, but that adds a requirement of total ordering on the data items, which I have, but that is not important right now. HashSet is nice because its lookup and insertion complexity is O(1), but that is also not too important right now.

Now let us look at a less desirable approach. Today I came across this quote from a computer programmer.
For most of my programming, I just use ArrayList or Vector and call Contains before inserting. It's lazy, but I've never had to deal with a big enough list to make it worth my time.
I would imagine his code looks something like this:
Collection extractSubscribers(Collection mailingLists)
{
Collection subscriberNames = new ArrayList();

for (MailingList mailingList : mailingLists) {
for (Subscriber subscriber : mailingList.getSubscribers()) {
if (!subscriberNames.contains(subscriber.getName())) {
subscriberNames.add(subscriber.getName());
}
}
}
return subscriberNames;
}
There is only one extra line of code in this case, but that is due to the fact that the ArrayList datatype holding our unique subscriber names does not have the semantics we want. It also means that we would have to write even more code to get the method signature to reflect the semantics of the requirements. If we wanted the return type to be Set, we would have to put all of the subscriber names in a set anyway, but I shall stipulate that literate programming is not important to the person making the quote above. However, it would make putting everything in an ArrayList first look pretty strange.

The code in the second example loses some semantic meaning because the return type does not indicate that the names in the collection are unique, but it also is considerably slower than the code using a HashSet. Because the time complexity of the first example is O(n), and that of the second example is O(n^2), finding the unique set of subscriber names in the second example will be roughly 1000 times slower than for the first example when there are 1000 subscribers.

This complexity issue may not seem important in the age of GigaHertz computing, but a coworker of mine found that the implementers of Apache Ant, a build system for Java, did not seem to appreciate this. The software I develop at my job results in about 15,000 Java class files, and those all have to be copied by an ant "copy" task in the build target directory. Ant 1.6 and ant 1.7 both build a list of unique file names to copy using a Vector (like an ArrayList) collection. In our build, this results in about 30 seconds to one minute of 100% CPU utilization on a 2GHz Core Duo! That is 25% to 40% of the total build time!

Fortunately, Ant is an open source project, so changing the code to use a Set rather than a Vector is almost trivial, and we have done so for our own use. It makes the builds 25% in most cases. All of that CPU time is wasted because the code was written using a datatype that was not appropriate and was also not semantically correct. Optimization could now be done if we decided that the semantically correct code had performance issues. In our example, the HashSet does have more overhead than the ArrayList or Vector, but that overhead is nearly constant for each name, so the performance does not degrade horribly when the number of names increases. However, for small data sets the overhead per name, which is negligible in our example, could be measurably significant compared to the cost of adding each name to the collection. The HashSet has to get the hash code for each name and do some comparisons that take more time than simply setting the next element of an array, but for a good implementation of HashSet, this overhead is not meaningful, and when the number of names is small, the whole process runs quickly anyway. It is only large numbers of names that cause problems, and they cause much worse problems when the wrong design is chosen. Avoiding the overhead of HashSet is one of the "small efficiencies" that is meant by Knuth.

Start with a correct design, and then optimize as needed, but do not mistake correct design for premature optimization.

Labels:

0 Comments:

Post a Comment

<< Home