The Great C++ Wordament: Meet Windows

This article is a reaction to some code that appeared on the Visual C++ blog a few days ago. You might say that we, James McNellis and Kenny Kerr, are addicted to writing code—preferably, fast code.

On Friday, Eric Battalio wrote about his adventures jumping back into C++. It involved a simple program to count the frequency of words contained within text files. In the comments, Stephan T. Lavavej contributed his own version that was both simpler and more effective at matching words. To be fair to Eric, Stephan is one of the foremost experts in the Standard Template Library.

Eric’s objective was to demonstrate some modern C++ techniques and to show how to complete some simple tasks using the C++ Standard Library.  Stephan’s solution built on that by replacing the word parser with the new C++11 <regex> library.  While writing the fastest possible implementation was not one of their objectives, it struck us that neither implementation was particularly fast.  We didn’t need profilers or benchmarks to determine this—it was evident from looking at the code that both implementations left buckets of performance sitting on the table.  We’ll use Stephan’s implementation as a reference implementation, the essence of which is as follows:

int main(int argc, char * argv[])
{
    std::regex const word_regex("\\w+");
    std::map<std::string, unsigned> result;

    for (int argi = 1; argi < argc; ++argi)
    {
        std::ifstream file(argv[argi]);

        if (!file)
        {
            continue;
        }

        for (std::string line; std::getline(file, line); )
        {
            for (std::sregex_token_iterator it(line.begin(), line.end(), word_regex),
                                            end; it != end; ++it)
            {
                ++result[*it];
            }
        }
    }
}

Performance analysis is an art and something that only comes with experience. It’s very tempting to assume you know what the performance bottlenecks are, but such assumptions are often wrong and one should always use a profiler to be sure. Still, looking at this code, a few things immediately jump out:

  • As much as we love regular expressions, for a match as simple as \w+ you will almost certainly achieve better performance with a trivial, hand-written implementation.
  • The Standard C++ I/O library—ios and friends—is a thing of beauty but it is not particularly fast. It is designed primarily for convenience and is most effective when handling console I/O rather than I/O in general.
  • In this implementation, the text of the file is scanned twice. First, it scans the file in the call to std::getline looking for the newline character. It then scans each line looking for words. The result is that each character in each file is visited twice.
  • Since each line is copied into a string and each word is then further copied into its own string, there is a great deal of copying and a whole lot of memory allocations.
  • Although the files obviously have no overlapping data they are scanned in sequence. The data set could easily be spread across available cores for greater throughput.

More improvements could be made. The trick is to identify some of the low-hanging fruit and consider whether they will provide meaningful improvements. It’s time to break the rules.

Mapped Files

The first step is realizing that the files in question are relatively small. We struggled to find any novels on Project Gutenberg that amounted to more than 2MB of text and settled on a handful of Dickens novels.  There’s no reason to stream such small files off the disk when you can simply map them into the address space of the process. This avoids repeated calls to the Windows API ReadFile function and the need to copy the data into user-mode buffers. This is in fact the way Windows loads executables and DLLs so presumably it’s going to be fast. File mapping objects are provided by the Windows Memory Manager, one of the most complex and efficient components of the operating system. Internally called section objects and mapped files in the Windows API, a file mapping object represents sections of an open file on disk.

wrl::FileHandle const file(CreateFile(name,
                                      GENERIC_READ,
                                      FILE_SHARE_READ,
                                      nullptr, // default security
                                      OPEN_EXISTING,
                                      FILE_ATTRIBUTE_NORMAL,
                                      nullptr)); // no template

if (!file.IsValid())
{
    // Possibly doesn’t exist – check with GetLastError.
}

In addition to the possibility of a nonexistent file, the file may in fact be empty. It’s important to check this, as the kernel won’t map an empty file. It’s also obviously useful to have the file size and we’ll need that in a moment.

LARGE_INTEGER size = { };

VERIFY(GetFileSizeEx(file.Get(),
                     &size));

if (!size.QuadPart)
{
    // file is empty
}

Given a handle to a file, the CreateFileMapping function creates a file mapping object and returns a handle to represent it. Such objects can actually refer to files that are much larger than what might fit into the address space of a process or even physical memory. Still, in this case it simpler to assume, and request through the Windows API, that the files will be mapped in their entirety. Given a handle to the file mapping object, the MapViewOfFile function actually maps it into the address space of the process. This function returns the address, a pointer, of the mapped view.

typedef wrl::HandleT<wrl::HandleTraits::HANDLENullTraits> MapHandle;

MapHandle const map(CreateFileMapping(file.Get(),
                                      nullptr, // default security
                                      PAGE_READONLY,
                                      0, 0, // match file size
                                      nullptr)); // no name

VERIFY(map.IsValid());

auto view = static_cast<char const *>(MapViewOfFile(map.Get(),
                                                    FILE_MAP_READ,
                                                    0, 0, // offset
                                                    0)); // match file size

Keep in mind that unlike CreateFile, the CreateFileMapping function does not return INVALID_HANDLE_VALUE on failure. Instead, it simply returns a nullptr value, thus the need for the alternative WRL traits class above. Of course, the primary reason that this function might fail is if the file is empty. It does not actually commit any memory so it is reasonably inexpensive and unlikely to fail if used correctly.

It helps to keep in mind that these Windows API functions create kernel objects that are reference counted. This is not unlike COM’s intrusive reference counting, at least conceptually. Unlike COM, the kernel keeps track of outstanding references held by each process and automatically releases them if the process fails to do so. When you call the CreateFile function, you effectively hold one reference to the file object. To release the reference you must call the CloseHandle function. The same goes for the CreateFileMapping function. The file mapping object however also holds a reference to the file object. You can thus release your reference to the file object and it will remain open as long as you still hold a reference to the file mapping object. It follows that the MapViewOfFile function works the same way. Holding onto just this reference will keep both the file and the file mapping object alive until you call the UnmapViewOfFile function.

if (view)
{
    VERIFY(UnmapViewOfFile(view));
}

In the complete program attached to this post, we use a simple class type to represent the mapped view of the file.  This class serves as an RAII container that automatically unmaps the view when it goes out of scope and provides access to the view as a range of bytes, via begin() and end() member functions.

Word Matching

Given that the text files are now mapped into memory, the program can simply scan the entire contents as if it were one long string. We can thus avoid scanning each file twice since there is no need to break it up into lines that are contiguous in memory. The two for loops in the original example now becomes a single loop, scanning each file just once.

for (std::cregex_token_iterator it(file.begin(), file.end(), word_regex),
                                end; it != end; ++it)
{
    ++result[*it];
}

Here file.begin() and file.end() are simply pointers delimiting the range of characters in the file (just as the begin() and end() members of a std::vector delimit the range of characters stored in the vector).  This implementation already performs much better for several reasons:

  1. we avoid the high overhead of the C++ I/O library
  2. we scan the file only once and do not make unnecessary copies of the text
  3. the characters are contiguous in memory and are scanned from beginning to end, and modern CPUs are designed to maximize performance of forward iteration over an array

However, we can still improve performance even further:  regular expressions are a wonderful tool, but there is some overhead to using them.  This overhead is especially acute with very simple expressions, like \w+.  This regular expression can be trivially implemented using a simple lexer that performs far better than the regular expression.  Here is the implementation:

char const * word_first = nullptr;

for (auto it = file.begin(); it != file.end(); ++it)
{
    if (*it < 0 || (!std::isalnum(*it) && *it != ‘_’))
    {
        if (word_first)
        {
            ++result[std::string(word_first, it)];
        }

        word_first = nullptr;
    }
    else if (!word_first)
    {
        word_first = it;
    }
}

if (word_first)
{
    ++result[std::string(word_first, file.end())];
}

This lexer makes a single pass over the text using the iterator named it.  When the lexer is scanning a word, the word_first iterator points to the initial character of the word; when it is not scanning a word (e.g. when it is scanning punctuation or whitespace), word_first is null.

The if statement tests true if the current character is not a word character.  We define word characters as letters, numbers, and the underscore (this is the same set of characters matched by \w).  For each character, we do one of three things depending on what the character is and what the current state of the lexer is:

If the character *it is a word character and…

  • …we are not currently scanning a word, then *it is the first character of a word and we set word_first = it
  • …we are currently scanning a word, then we do nothing (we simply continue with the next character)

If the character *it is not a word character and…

  • …we are not currently scanning a word, then we do nothing (we simply continue with the next character)
  • …we are currently scanning a word, then [word_first, it) is a word:  we increment the number of times we have seen it then reset word_first

Finally, if the file ends with a word, we process the last word.

Some might argue, “oh, you’re using pointers—that’s not modern C++ at all!”  Note, however, that the pointers used here are really just iterators into an array.  There are no raw pointers owning dynamically allocated objects.  There is no arithmetic that wouldn’t otherwise be valid for, say, std::vector<T>::iterator.  This particular function could be nicely wrapped up into a more generic “lexing” algorithm.

Concurrency

The final area worth exploring is whether a significant performance improvement could be made by scanning the files concurrently. It’s a reasonable assumption since the algorithm itself is compute-bound and the data is already partitioned into separate files and thus independent sections of memory. The challenge of course is to produce a single map of the results while avoiding the locking overhead required for updating it from different threads. Fortunately, the Visual C++ Concurrency Runtime (ConcRT) provides everything we might need. Still, there are many ways to solve the problem such as with explicit locks or replacing std::map with a currency-safe container. However, we prefer to avoid locks entirely.  The PPL’s combinable type is just what we need.

typedef std::map<std::string, unsigned> word_map;
ppl::combinable<word_map> shared;

The combinable type is sort of like an encapsulated thread-local storage wrapper.  We can then refer to the shared from within a parallel algorithm such as the PPL’s parallel_for_each and each thread will receive a local copy of the map.

ppl::parallel_for_each(argv + 1, argv + argc, [&](char const * name)
{
    auto & result = shared.local();

    // map and scan file …
});

Note that this is not necessarily one local copy per file but rather one per runtime thread. If the computer only has two cores and the scheduler decides that those threads are rarely blocking, which they won’t be in this case, then it’s unlikely that more than two maps will be used no matter how many files are processed. Of course, the results still need to be combined, which can be done using the combine_each function.

word_map result;

shared.combine_each([&] (word_map const & source)
{
    for (auto const & w : source)
    {
        result[w.first] += w.second;
    }
});

The lambda will be called for each local copy of the map thereby making it easy to combine the results in one simple step.  With about ten lines of code, we’ve made our program concurrent.

The End Result

Our final implementation is roughly 45x faster than the reference implementation when processing eight files on a quad core Intel i7.  Its single-threaded throughput is substantially better as well:  it is 12x faster when processing a single file, in part due to the use of memory-mapped I/O, and in part due to the use of a custom lexer.  Four large text files totaling 243MB in size were used for measurement.

Does this implementation offer the best possible performance?   Hardly.  We’ve identified a few opportunities where we might be able to improve performance:

  1. Our implementation still uses std::string, which stores a copy of the string that it represents.  Each time we construct a new std::string, a new copy is made.  These copies are all unnecessary:  since we are mapping the entire file in memory, we could simply store pointers to the words represented in the mapped range.
  2. Words are only removed from the map when the map is destroyed, so a custom allocator could be used that offers better performance for this particular usage scenario than the default heap allocator.
  3. We scan the text linearly, so it is possible to build a hash of each word as we read them, without making an additional scan over the text.  This could allow use of unordered containers or could be used to offer better ordered container insertion performance (e.g. by sorting first by hash, then by text).

We prototyped (1) and (2) and found that both yielded small performance improvements (on the order of about 7% for each of them).  (1) yielded less of a benefit than we expected because most words are small, so the small string optimization (SSO) allows most string constructions not to require heap allocation.  However, both of these changes resulted in code that was substantially more complex than our final implementation, so we chose not to include them.

One of our goals was to demonstrate that it’s not particularly difficult to write high performance code in C++, though it does take a bit more work and some knowledge of the platform.  Our final implementation totals 148 lines, including #includes and timing code for performance measurement.  This is just under 3x longer than the reference implementation, though much of that is because we chose to implement a proper RAII container—file_view—to own a mapped view of a file.  The algorithm is more complex, but not substantially more so.

While some platform-specific code is required, it would be quite straightforward to encapsulate this platform-specific code in a library that has a platform-neutral API.  Most platforms support mapping of files into memory, they just support it differently.

Finally, no commas were harmed in the making of this documentary.

The source code: words.cpp

9 thoughts on “The Great C++ Wordament: Meet Windows

  1. Eric Battalio MSFT

    Wow, who knew such a “simple” exercise would yield such a valuable (for me) discussion of C++ and the best way to implement the word counting requirements.

    It is easy to see the progression that most applications make in private completely in public here. The original code was geared for the newby (me) trying out C++ features on a real project. Subsequent updates show the evolution of the code into a better performing, more correct application with detailed explanation of tradeoffs, traps and other details.

    I’ve learned quite a bit. Thanks Kenny and James.

    Reply
  2. Thomas Petit

    Another optimization opportunity, which can be tested very quickly, is to simply replace the std::map by std:: unordered_map and recompile.
    On my machine, the code with std::unordered_map is about 3x faster than std::map.

    Reply
    1. Kenny Kerr Post author

      Yes. The original goal was to reproduce the sorted output exactly. Having said that, it’s easy enough to collect into a set of unordered_maps and then combine into the resulting (sorted) map.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s