Monthly Archives: January 2013

The API behind the API

Despite what anyone might tell you, the Windows Runtime API is not a clean break from the past. Like .NET before it, WinRT includes a backdoor without which it would be practically useless. The Common Language Runtime’s backdoor was called Platform Invocation Services or P/Invoke for short. It was amazingly powerful, but also complex and troublesome. WinRT’s backdoor is a lot simpler. It’s called reinterpret_cast.

As I’ve already illustrated in The Road to Windows 8 and Windows 8, where’d you put my HWND?!, WinRT is projected into C++ through a set of extensions that, among other things, allows the compiler to insert code to automatically manage reference counts as if a COM smart pointer class were used. For example, every WinRT application (or at least those that run within an app container) has a CoreWindow:

auto w = CoreWindow::GetForCurrentThread();

If you were feeling verbose, you might write:

CoreWindow ^ cw = CoreWindow::GetForCurrentThread();

The ^, pronounced “hat”, defines a handle-to-object, which the compiler treats as a pointer-to-object with intrusive reference counting provided by the object’s IUnknown interface and administered by the compiler.

Without WinRT’s backdoor however, no Windows Store or Windows Phone app would be able to function. In the case of CoreWindow, the application ultimately needs to bind the application’s swap chain or composition target to the HWND that the CoreWindow represents. Without reinterpret_cast, this would not be possible. I’m obviously speaking of reinterpret_cast in the proverbial sense. You don’t need to tell me that a C-style cast will do or that it could all be done without /ZW.

CoreWindow is not alone. While ICoreWindowInterop may not be documented, another WinRT type, namely SwapChainBackgroundPanel, openly flaunts its “other” API. It is after all, the only reason for this XAML type’s existence. The ISwapChainBackgroundPanelNative interface provides the only panel method that you really need to use.

SwapChainBackgroundPanel ^ panel = …
IUnknown * unknown = reinterpret_cast<IUnknown *>(panel);
ComPtr<ISwapChainBackgroundPanelNative> native;

There are other examples, but the point is that there is again an API behind the API.

As developers begin to use C++/CX more, I felt it would be useful to offer a littler helper to make access to this backdoor a little more convenient, and safe.

template <typename To>
ComPtr<To> winrt_cast(Object ^ from)
ComPtr<To> to;
HR(reinterpret_cast<IUnknown *>(from)->QueryInterface(to.GetAddressOf()));
return to;

With the winrt_cast function template, you can quickly and easily reach in and retrieve the native or interop interface that a particular WinRT type might provide.

auto native = winrt_cast<ISwapChainBackgroundPanelNative>(panel);

auto interop = winrt_cast<ICoreWindowInterop>(window);

The winrt_cast function ensures that the resulting “hat-less” COM pointer is safely wrapped inside WRL’s excellent smart pointer. Since ComPtr is move-aware, returning it in this way is guaranteed not to introduce an unnecessary reference-counting heartbeat. Error handling is also not an issue. Although QueryInterface is traditionally analogous to dynamic_cast in the sense that it allows feature discovery at run-time, in this case the WinRT types are guaranteed to provide the particular interfaces. Without this capability, they would be useless. Still, it’s up to you to decide how best to deal with errors in your application or library. I tend to define HR as follows.

#ifdef DEBUG
#define HR(expression) ASSERT(S_OK == (expression))
inline void HR(HRESULT hr) { if (S_OK != hr) throw Exception::CreateException(hr); }

If I’ve done something wrong, I’m treated to an assertion during development. If something goes horribly wrong at run-time, my application is quickly torn down. Obviously, as with P/Invoke, this is not something to be used lightly or without thinking. Don’t go using winrt_cast on a String^, which holds an HSTRING rather than IUnknown pointer. You can for example disable this helper for strings as follows:

template <typename To> ComPtr<To> winrt_cast(String ^);

Indeed, this solution is for those cases where you know a particular WinRT type exposes a particular COM interface outside of the discoverable type system. James McNellis offers a slightly more verbose solution, but one that is more widely applicable here. He also discusses hats in a lot more detail here.

Hope this helps.


The Great C++ Wordament: Meet Windows

This article is a reaction to some code that appeared on the Visual C++ blog a few days ago. You might say that we, James McNellis and Kenny Kerr, are addicted to writing code—preferably, fast code.

On Friday, Eric Battalio wrote about his adventures jumping back into C++. It involved a simple program to count the frequency of words contained within text files. In the comments, Stephan T. Lavavej contributed his own version that was both simpler and more effective at matching words. To be fair to Eric, Stephan is one of the foremost experts in the Standard Template Library.

Eric’s objective was to demonstrate some modern C++ techniques and to show how to complete some simple tasks using the C++ Standard Library.  Stephan’s solution built on that by replacing the word parser with the new C++11 library.  While writing the fastest possible implementation was not one of their objectives, it struck us that neither implementation was particularly fast.  We didn’t need profilers or benchmarks to determine this—it was evident from looking at the code that both implementations left buckets of performance sitting on the table.  We’ll use Stephan’s implementation as a reference implementation, the essence of which is as follows:

int main(int argc, char * argv[]) 
    std::regex const word_regex("\\w+"); 
    std::map<std::string, unsigned> result;

    for (int argi = 1; argi < argc; ++argi) 
        std::ifstream file(argv[argi]);

        if (!file) 

        for (std::string line; std::getline(file, line); ) 
            for (std::sregex_token_iterator it(line.begin(), line.end(), word_regex), 
                                            end; it != end; ++it) 

Performance analysis is an art and something that only comes with experience. It’s very tempting to assume you know what the performance bottlenecks are, but such assumptions are often wrong and one should always use a profiler to be sure. Still, looking at this code, a few things immediately jump out:

  • As much as we love regular expressions, for a match as simple as \w+ you will almost certainly achieve better performance with a trivial, hand-written implementation.
  • The Standard C++ I/O library—ios and friends—is a thing of beauty but it is not particularly fast. It is designed primarily for convenience and is most effective when handling console I/O rather than I/O in general.
  • In this implementation, the text of the file is scanned twice. First, it scans the file in the call to std::getline looking for the newline character. It then scans each line looking for words. The result is that each character in each file is visited twice.
  • Since each line is copied into a string and each word is then further copied into its own string, there is a great deal of copying and a whole lot of memory allocations.
  • Although the files obviously have no overlapping data they are scanned in sequence. The data set could easily be spread across available cores for greater throughput.

More improvements could be made. The trick is to identify some of the low-hanging fruit and consider whether they will provide meaningful improvements. It’s time to break the rules.

Mapped Files

The first step is realizing that the files in question are relatively small. We struggled to find any novels on Project Gutenberg that amounted to more than 2MB of text and settled on a handful of Dickens novels.  There’s no reason to stream such small files off the disk when you can simply map them into the address space of the process. This avoids repeated calls to the Windows API ReadFile function and the need to copy the data into user-mode buffers. This is in fact the way Windows loads executables and DLLs so presumably it’s going to be fast. File mapping objects are provided by the Windows Memory Manager, one of the most complex and efficient components of the operating system. Internally called section objects and mapped files in the Windows API, a file mapping object represents sections of an open file on disk.

wrl::FileHandle const file(CreateFile(name, 
                                      nullptr, // default security 
                                      nullptr)); // no template

if (!file.IsValid()) 
    // Possibly doesn’t exist – check with GetLastError. 

In addition to the possibility of a nonexistent file, the file may in fact be empty. It’s important to check this, as the kernel won’t map an empty file. It’s also obviously useful to have the file size and we’ll need that in a moment.

LARGE_INTEGER size = { };


if (!size.QuadPart) 
    // file is empty 

Given a handle to a file, the CreateFileMapping function creates a file mapping object and returns a handle to represent it. Such objects can actually refer to files that are much larger than what might fit into the address space of a process or even physical memory. Still, in this case it simpler to assume, and request through the Windows API, that the files will be mapped in their entirety. Given a handle to the file mapping object, the MapViewOfFile function actually maps it into the address space of the process. This function returns the address, a pointer, of the mapped view.

typedef wrl::HandleT<wrl::HandleTraits::HANDLENullTraits> MapHandle;

MapHandle const map(CreateFileMapping(file.Get(), 
                                      nullptr, // default security 
                                      0, 0, // match file size 
                                      nullptr)); // no name


auto view = static_cast<char const *>(MapViewOfFile(map.Get(), 
                                                    0, 0, // offset 
                                                    0)); // match file size

Keep in mind that unlike CreateFile, the CreateFileMapping function does not return INVALID_HANDLE_VALUE on failure. Instead, it simply returns a nullptr value, thus the need for the alternative WRL traits class above. Of course, the primary reason that this function might fail is if the file is empty. It does not actually commit any memory so it is reasonably inexpensive and unlikely to fail if used correctly.

It helps to keep in mind that these Windows API functions create kernel objects that are reference counted. This is not unlike COM’s intrusive reference counting, at least conceptually. Unlike COM, the kernel keeps track of outstanding references held by each process and automatically releases them if the process fails to do so. When you call the CreateFile function, you effectively hold one reference to the file object. To release the reference you must call the CloseHandle function. The same goes for the CreateFileMapping function. The file mapping object however also holds a reference to the file object. You can thus release your reference to the file object and it will remain open as long as you still hold a reference to the file mapping object. It follows that the MapViewOfFile function works the same way. Holding onto just this reference will keep both the file and the file mapping object alive until you call the UnmapViewOfFile function.

if (view) 

In the complete program attached to this post, we use a simple class type to represent the mapped view of the file.  This class serves as an RAII container that automatically unmaps the view when it goes out of scope and provides access to the view as a range of bytes, via begin() and end() member functions.

Word Matching

Given that the text files are now mapped into memory, the program can simply scan the entire contents as if it were one long string. We can thus avoid scanning each file twice since there is no need to break it up into lines that are contiguous in memory. The two for loops in the original example now becomes a single loop, scanning each file just once.

for (std::cregex_token_iterator it(file.begin(), file.end(), word_regex), 
                                end; it != end; ++it) 

Here file.begin() and file.end() are simply pointers delimiting the range of characters in the file (just as the begin() and end() members of a std::vector delimit the range of characters stored in the vector).  This implementation already performs much better for several reasons:

  1. we avoid the high overhead of the C++ I/O library
  2. we scan the file only once and do not make unnecessary copies of the text
  3. the characters are contiguous in memory and are scanned from beginning to end, and modern CPUs are designed to maximize performance of forward iteration over an array

However, we can still improve performance even further:  regular expressions are a wonderful tool, but there is some overhead to using them.  This overhead is especially acute with very simple expressions, like \w+.  This regular expression can be trivially implemented using a simple lexer that performs far better than the regular expression.  Here is the implementation:

char const * word_first = nullptr;

for (auto it = file.begin(); it != file.end(); ++it) 
    if (*it < 0 || (!std::isalnum(*it) && *it != ‘_’)) 
        if (word_first) 
            ++result[std::string(word_first, it)]; 

        word_first = nullptr; 
    else if (!word_first) 
        word_first = it; 

if (word_first) 
    ++result[std::string(word_first, file.end())]; 

This lexer makes a single pass over the text using the iterator named it.  When the lexer is scanning a word, the word_first iterator points to the initial character of the word; when it is not scanning a word (e.g. when it is scanning punctuation or whitespace), word_first is null.

The if statement tests true if the current character is not a word character.  We define word characters as letters, numbers, and the underscore (this is the same set of characters matched by \w).  For each character, we do one of three things depending on what the character is and what the current state of the lexer is:

If the character *it is a word character and…

  • …we are not currently scanning a word, then *it is the first character of a word and we set word_first = it
  • …we are currently scanning a word, then we do nothing (we simply continue with the next character)

If the character *it is not a word character and…

  • …we are not currently scanning a word, then we do nothing (we simply continue with the next character)
  • …we are currently scanning a word, then [word_first, it) is a word:  we increment the number of times we have seen it then reset word_first

Finally, if the file ends with a word, we process the last word.

Some might argue, “oh, you’re using pointers—that’s not modern C++ at all!”  Note, however, that the pointers used here are really just iterators into an array.  There are no raw pointers owning dynamically allocated objects.  There is no arithmetic that wouldn’t otherwise be valid for, say, std::vector::iterator.  This particular function could be nicely wrapped up into a more generic “lexing” algorithm.


The final area worth exploring is whether a significant performance improvement could be made by scanning the files concurrently. It’s a reasonable assumption since the algorithm itself is compute-bound and the data is already partitioned into separate files and thus independent sections of memory. The challenge of course is to produce a single map of the results while avoiding the locking overhead required for updating it from different threads. Fortunately, the Visual C++ Concurrency Runtime (ConcRT) provides everything we might need. Still, there are many ways to solve the problem such as with explicit locks or replacing std::map with a currency-safe container. However, we prefer to avoid locks entirely.  The PPL’s combinable type is just what we need.

typedef std::map<std::string, unsigned> word_map; 
ppl::combinable<word_map> shared;

The combinable type is sort of like an encapsulated thread-local storage wrapper.  We can then refer to the shared from within a parallel algorithm such as the PPL’s parallel_for_each and each thread will receive a local copy of the map.

ppl::parallel_for_each(argv + 1, argv + argc, [&](char const * name) 
    auto & result = shared.local();

    // map and scan file … 

Note that this is not necessarily one local copy per file but rather one per runtime thread. If the computer only has two cores and the scheduler decides that those threads are rarely blocking, which they won’t be in this case, then it’s unlikely that more than two maps will be used no matter how many files are processed. Of course, the results still need to be combined, which can be done using the combine_each function.

word_map result;

shared.combine_each([&] (word_map const & source) 
    for (auto const & w : source) 
        result[w.first] += w.second; 

The lambda will be called for each local copy of the map thereby making it easy to combine the results in one simple step.  With about ten lines of code, we’ve made our program concurrent.

The End Result

Our final implementation is roughly 45x faster than the reference implementation when processing eight files on a quad core Intel i7.  Its single-threaded throughput is substantially better as well:  it is 12x faster when processing a single file, in part due to the use of memory-mapped I/O, and in part due to the use of a custom lexer.  Four large text files totaling 243MB in size were used for measurement.

Does this implementation offer the best possible performance?   Hardly.  We’ve identified a few opportunities where we might be able to improve performance:

  1. Our implementation still uses std::string, which stores a copy of the string that it represents.  Each time we construct a new std::string, a new copy is made.  These copies are all unnecessary:  since we are mapping the entire file in memory, we could simply store pointers to the words represented in the mapped range.
  2. Words are only removed from the map when the map is destroyed, so a custom allocator could be used that offers better performance for this particular usage scenario than the default heap allocator.
  3. We scan the text linearly, so it is possible to build a hash of each word as we read them, without making an additional scan over the text.  This could allow use of unordered containers or could be used to offer better ordered container insertion performance (e.g. by sorting first by hash, then by text).

We prototyped (1) and (2) and found that both yielded small performance improvements (on the order of about 7% for each of them).  (1) yielded less of a benefit than we expected because most words are small, so the small string optimization (SSO) allows most string constructions not to require heap allocation.  However, both of these changes resulted in code that was substantially more complex than our final implementation, so we chose not to include them.

One of our goals was to demonstrate that it’s not particularly difficult to write high performance code in C++, though it does take a bit more work and some knowledge of the platform.  Our final implementation totals 148 lines, including #includes and timing code for performance measurement.  This is just under 3x longer than the reference implementation, though much of that is because we chose to implement a proper RAII container—file_view—to own a mapped view of a file.  The algorithm is more complex, but not substantially more so.

While some platform-specific code is required, it would be quite straightforward to encapsulate this platform-specific code in a library that has a platform-neutral API.  Most platforms support mapping of files into memory, they just support it differently.

Finally, no commas were harmed in the making of this documentary.

The source code: words.cpp

Upcoming Training Material

Here’s what you can expect from me in terms of training material this year.


Last year I published C Programming Language Fundamentals and Direct2D Fundamentals. The latter provides a comprehensive introduction to the original version of Direct2D that first shipped with Windows 7. Here’s a preview.

Later this month I will release Direct2D Fundamentals – Part 2. This new course focuses on the version of Direct2D that first shipped with Windows 8. It includes the necessary information about Direct3D and DXGI that you will need to leverage Direct2D 1.1 as well as everything you need to know in order to use Direct2D on the Windows desktop, with the Windows Runtime, with XAML, on the Windows Phone, and much more.

If this sounds interesting then get yourself ready by working through Direct2D Fundamentals – Part 1. That will give you the necessary groundwork for part 2.

MSDN Magazine

I first wrote about Direct2D in the magazine back when Windows 7 was in beta.

Introducing Direct2D

Drawing with Direct2D

Layered Windows with Direct2D

Direct2D and the Desktop Window Manager

Windows Phone and Direct2D (more recently)

I decided to spend a bit more time this year on Direct2D, both to bring some more attention to this great API as well as to introduce some of the Direct2D-related innovation in the Windows 8 family of operating systems. As such, here is an outline of topics for my Windows with C++ column over the next few months.

February: Creating Desktop Apps with Visual C++ 2012 – A desktop application primer

March: Rendering in a Desktop App with Direct2D – From USER/GDI to Direct2D 1.0

April: Introducing Direc2D 1.1 – Where Direct2D embraces Direct3D and DXGI (on the desktop)

May: Direct2D and the Windows Runtime

June: Direct2D and XAML

What Next?

I am planning to do a Pluralsight course on Classic COM and WinRT using modern C++ and WRL. I may also spend a few months later this year focusing on WRL in my Windows with C++ column.

I know Kate Gregory is also planning some more excellent C++ and Visual Studio material for Pluralsight.

Beyond that, what would you like to see in terms of training material?


You can also find me on Twitter at @kennykerr


SyncTools for Sysinternals

SyncTools is a meta-tool that keeps a folder on your computer up-to-date with all the latest tools from Sysinternals. Simply pick a folder where you would like to keep the Sysinternals tools and run SyncTools.exe in that folder. It will download all of the tools and check for updates on tools it previously downloaded. Any time Mark Russinovich publishes an updated version or even a completely new tool, simply rerun SyncTools.exe to download it for you.

Using SyncTools

Download SyncTools.exe and copy it to a local folder such as C:\Tools.

The first time you run SyncTools.exe it will download all of the tools.

When new tools or updates are available, simply rerun SyncTools.exe to check for updates and it will download updates as necessary.

A * prefix means it’s a new file while a u prefix mean it’s an update. If there’s a problem with a download then a ! prefix is used and a description hints at the problem.

In the screenshot above, I didn’t have a copy of Autoruns, LiveKd and PsKill were updated, but I left Process Explorer running and SyncTools was not able to update it. The solution is simple: simply close the tool in question and rerun SyncTools and it will try to download the update again.

How it works

I wrote this little tool a few years back as an excuse to use asynchronous WinHTTP and it has served me well. Last night I rewrote it from scratch as I had some new techniques I wanted to explore. This version is simpler, although not quite as fast as it only downloads one file at a time.

The first thing it does is download the Sysinternals directory listing. If SyncTools was previously run it will then check the file signatures from the directory listing with the information from the last run. Only new or updated files will be download.

By default, it ignores the following files either because the web server won’t serve them up anyway or because they just aren’t of any use.


If you would like to tailor this list, simply create a text file in your tools folder called .SyncIgnore and place any file name patterns in it, each on its own line.

Finally, you can use the -u and -d arguments to tweak the tool’s behavior. The -u argument lets you provide an alternative URL and the -d argument lets you specify something other than the current directory to sync to.

SyncTools.exe -d C:\Tools -u


The Evolution of Threads and I/O in Windows

My latest column for MSDN Magazine is now available online as well as in print.

When starting a new project, do you ask yourself whether your program will be compute-bound or I/O-bound? You should. I’ve found that in most cases it’s either one or the other. You might be working on an analytics library that’s handed a pile of data and keeps a bunch of processors busy while crunching it all down to a set of aggregates. Alternatively, your code might spend most of its time waiting for stuff to happen, for data to arrive over the network, a user to click on something or perform some complex six-fingered gesture. In this case, the threads in your program aren’t doing very much. Sure, there are cases where programs are heavy both on I/O and computation. The SQL Server database engine comes to mind, but that’s less typical of today’s computer programming. More often than not, your program is tasked with coordinating the work of others. It might be a Web server or client communicating with a SQL database, pushing some computation to the GPU or presenting some content for the user to interact with. Given all of these different scenarios, how do you decide what threading capabilities your program requires and what concurrency building blocks are necessary or useful? Well, that’s a tough question to answer generally and something you’ll need to analyze as you approach a new project. It’s helpful, however, to understand the evolution of threading in Windows and C++ so you can make an informed decision based on the practical choices that are available.

This article carries on where I left off in November with The Evolution of Synchronization in Windows and C++.

You can find links to more of my articles here.