Tuesday, January 5, 2010

Concatenating Strings Efficiently - When to use StringBuilder, and when not to.

One of the first pieces of efficiency advice most .NET developers learn is "use StringBuilder to concatenate strings". A little bit like "exceptions are expensive" this is a misunderstood piece of received wisdom. (Fortunately it's not nearly as harmful as the exception performance myth, but it comes up about as often...)
Before reading the rest of this page, you should be aware of the basics of the String type. For the sake of readability, I'll stick to "string" rather than "String" or "string" from here onwards. Please let me know if you find this confusing, and I'll change it.
I've included this in my list of general framework articles rather than in the list of C#-specific articles because I suspect every language targetting .NET is likely to use the same way of concatenating strings under the hood.

The Problem We're Trying To Avoid

There is a very real problem that this wisdom correctly addresses - that of building a load of strings (usually of increasing size) which are never used apart from contributing to the creation of other strings. Here's an example:
using System;

public class Test
{
static void Main()
{
DateTime start = DateTime.Now;
string x = "";
for (int i=0; i < 100000; i++)
{
x += "!";
}
DateTime end = DateTime.Now;
Console.WriteLine ("Time taken: {0}", end-start);
}
}
On my (fairly fast) laptop, that takes nearly 10 seconds. Double the number of iterations, and it takes over a minute. The results are slightly better on .NET 2.0 beta 2, but not hugely. The problem is that strings are immutable - just because we're using "+=" here doesn't mean the runtime actually appends to the end of the existing string. In fact, x += "!"; is absolutely equivalent to x = x+"!";. The concatenation here is creating an entirely new string, allocating enough memory for everything, copying all the data from the existing value of x and then copying the data from the string being appended ("!"). As the string grows, the amount of data it has to copy each time grows too, which is why the time taken didn't just double when I doubled the number of iterations.
This is clearly inefficient. If someone asked you to add something to a shopping list, you wouldn't write a new copy of the shopping list first, would you? Enter StringBuilder...

The StringBuilder Solution

Here is an equivalent (in terms of the final value of x) program, which is much, much faster:
using System;
using System.Text;

public class Test
{
static void Main()
{
DateTime start = DateTime.Now;
StringBuilder builder = new StringBuilder();
for (int i=0; i < 100000; i++)
{
builder.Append("!");
}
string x = builder.ToString();
DateTime end = DateTime.Now;
Console.WriteLine ("Time taken: {0}", end-start);
}
}
On my laptop, this executes too fast for the rudimentary timing mechanism I'm using to give me any sensible results. Changing to a million iterations (i.e. ten times what the first program took nearly ten seconds to do), it takes about 30-40ms. The time taken is roughly linear in the number of iterations (i.e. double the iterations and it takes twice as long). It does this by avoiding unnecessary copying - only the data we're actually appending gets copied. StringBuilder maintains an internal buffer and appends to that, only copying its buffer when there isn't room for any more data. (In fact, the internal buffer is just a string - strings are immutable from a public interface perspective, but not from within the mscorlib assembly.) We could make the above code even more efficient by passing the final size of the string (which we happen to know in this case) to the constructor of StringBuilder to make it use a buffer of the right size to start with - then there'd be no unnecessary copying at all. Unless you're in a situation where you have that information readily to hand though, it's usually not worth worrying about - StringBuilder doubles its buffer size when it runs out of room, so it doesn't end up copying the data very many times anyway.

So I Should Use StringBuilder Everywhere, Right?

No, quite simply. The above is an explanation of why the received wisdom of "use StringBuilder for concatenation" is right some of the time. However, many people take it at face value without understanding the reasoning behind it. They start turning code like this:
string name = firstName + " " + lastName;
Person person = new Person (name);
into code like this:
// Bad code! Do not use!
StringBuilder builder = new StringBuilder();
builder.Append (firstName);
builder.Append (" ");
builder.Append (lastName);
string name = builder.ToString();
Person person = new Person (name);
All that in the name of inefficiency. Now, on a broader point, even if the second version were more efficient than the first, it probably wouldn't be significantly more efficient - unless that code ended up being called vast, vast numbers of times, the amount of time spent in the concatenation is likely to be tiny. Making something less readable (and I think you'll agree the second version takes far longer to understand) for a small performance gain is a really bad idea in the first place.
However, the second version is actually less efficient than the first! Not much less efficient - and if the second version were more readable, I'd go with it for the reasons above - but when the entire point of using StringBuilder is to improve efficiency, using the above is just plain nuts.
The first version (assuming that firstName and lastName are real variables, and not constants - I'll come onto that later) compiles to a call to String.Concat, like this:
string name = String.Concat (firstName, " ", lastName);
Person person = new Person (name);
String.Concat takes a bunch of strings (or objects) and concatenates them together, plain and simple. There are various overloads - some take strings, some take objects (which are just converted into strings), some take arrays of objects or arrays of strings. They all do the same thing though. Now, String.Concat can work out the lengths of all the strings involved before it concatenates them together (at least if you pass it strings - if you pass it objects, it needs to create temporary strings and then concatenate those together). This means that no extra copying is involved - the data is copied once into the new string, which is of exactly the right length.
Compare this with the StringBuilder version. It doesn't know at construction time how big to make the buffer (because we haven't told it - doing so would make the code even less readable). That means it may have to copy the buffer, and it's likely to end up with a buffer which is actually larger than it needs to be. Oh, and there's the overhead of an extra object (the StringBuilder itself). Remind me why this was meant to be a good idea?
The important difference between this example and the previous one is that we can easily present all the strings which need to be concatenated together in one call to String.Concat. That means that no intermediate strings are needed. StringBuilder is efficient in the first example because it acts as a container for the intermediate result without having to copy that result each time - when there's no intermediate result anyway, it has no advantage.

Constants

Things get even crazier when it comes to string constants (literals, const string members). What do you suppose string x = "hello" + " " + "there"; is compiled to? It would be reasonable to expect it to be another call to String.Concat - but it isn't. It's actually compiled to the exact same code as string x = "hello there";. The compiler knows that all the parts are constant, so it does all the concatenation at compile time, storing the full string in the compiled code. Converting that to use StringBuilder is inefficient in both memory and speed, as well as reducing readability.

Rules Of Thumb

So, when should you use StringBuilder, and when should you use the string concatenation operators?
  • Definitely use StringBuilder when you're concatenating in a non-trivial loop - especially if you don't know for sure (at compile time) how many iterations you'll make through the loop. For example, reading a file a character at a time, building up a string as you go using the += operator is potentially performance suicide.
  • Definitely use the concatenation operator when you can (readably) specify everything which needs to be concatenated in one statement. (If you have an array of things to concatenate, consider calling String.Concat explicitly - or String.Join if you need a delimiter.)
  • Don't be afraid to break literals up into several concatenated bits - the result will be the same. You can aid readability by breaking a long literal into several lines, for instance, with no harm to performance.
  • If you need the intermediate results of the concatenation for something other than feeding the next iteration of concatenation, StringBuilder isn't going to help you. For instance, if you build up a full name from a first name and a last name, and then add a third piece of information (the nickname, maybe) to the end, you'll only benefit from using StringBuilder if you don't need the (first name + last name) string for other purpose (as we do in the example which creates a Person object).
  • If you just have a few concatenations to do, and you really want to do them in separate statements, it doesn't really matter which way you go. Which way is more efficient will depend on the number of concatenations the sizes of string involved, and what order they're concatenated in. If you really believe that piece of code to be a performance bottleneck, profile or benchmark it both ways.

1 comment:


  1. It is nice article to improve knowledge.thank you for sharing useful info
    web programming tutorial
    welookups

    ReplyDelete