Kenny Kerr

The Essentials of COM – Now on Pluralsight!

with 6 comments

My latest Pluralsight course is now available: The Essentials of COM

The short version: Learn the essentials of Microsoft’s Component Object Model. Understand its history and design, its relationship to classic C++, and how to write COM code effectively with modern C++.

Need more convincing?

This is the first in a series of courses exploring the essentials of COM and the Windows Runtime. In this first course, you are going to learn everything you need to know to use COM confidently and effectively. This course explores the history of COM. You will learn about how C++ shaped the design of COM. You will learn the essentials of compilation and linking, fundamental to the idea of dynamically linking applications and components together. By the end of this course you will have a thorough understanding of COM’s IUnknown interface, how and why it works the way it does, how to implement it, how to use it, and how to wrap it up in a safe and efficient manner with modern C++ libraries.

Here’s a brief description of the modules in The Essentials of COM.

Microsoft’s Component Object Model: the essence of COM, the core concepts, where it came from, and the key players.

COM and C++: COM as a better C++98, dynamic linking, exporting objects, managing lifetime, and extending objects.

IUnknown and modern C++: HRESULTs, GUIDs, implementing IUnknown, IUnknown and C#, smart pointers, and C++11 as a better COM.

Libraries: Windows Runtime C++ Template Library and the Active Template Library

Stay tuned for “The Essentials of WinRT” as well as part 2 of this course, covering activation, apartments, and security!

I realize that many developers are captivated by managed code, but if you’re one of the lucky ones that still use C++ then please help spread the word regarding my Pluralsight courses. I love producing these courses, but it’s only possible with your support. So go out there and tell everyone you know that Pluralsight is the place to go for hardcode C++ developer training.

 

Written by Kenny Kerr

14 May 2013 at 6:24 pm

Posted in Uncategorized

Introducing Direct2D 1.1 (and DirectX 11.1)

with 2 comments

My latest column for the May 2013 issue of MSDN Magazine is now available online.

Direct2D 1.1 might sound like a minor version update, and in some ways, it is. It doesn’t fundamentally change the API. Everything you know about Direct2D continues to be every bit as relevant today. It’s still modeled around device-specific and device-independent resources, render targets, geometries, brushes and so on. But in version 1.1, Direct2D grows up. The original version of Direct2D that launched with Windows 7 was in some ways an outsider to DirectX. It lagged behind, being tied to DirectX 10 rather than 11, the version of DirectX that it launched with. Even though it provided an excellent interop story for GDI and the Windows Imaging Component, it didn’t provide the best possible experience for working with DirectX itself. It wasn’t bad, but in Direct2D 1.1 things get a whole lot better. Direct3D and Direct2D are now in many ways siblings in the DirectX family. Thanks to this greater parity, even more of the graphics processing unit (GPU) is now available to the Direct2D developer without the need to jump out of the 2D abstraction. Moreover, when you do need to make the leap, it’s both simple and efficient.

Check out my latest course on Pluralsight for a crash course on DirectX 11.1 and Direct2D 1.1!

You can find links to more of my articles here.

 

Written by Kenny Kerr

1 May 2013 at 2:17 pm

Posted in Uncategorized

Rendering in a Desktop Application with Direct2D

leave a comment »

My latest column for the March 2013 issue of MSDN Magazine is now available online.

One of the key design underpinnings of Direct2D is that it focuses on rendering and leaves the other aspects of Windows application development to you or other libraries that you might employ. Although Direct2D was designed to render in a desktop window, it’s up to you to actually provide this window and optimize it for Direct2D rendering. So this month, I’m going to focus on the unique relationship between Direct2D and the desktop application window. You can do many things to optimize the window handling and rendering process. You want to reduce unnecessary painting and avoid flicker, and just provide the best possible experience for the user.

This is the last installment before I dive into Direct2D 1.1 and everything related to Windows 8.

You can find links to more of my articles here.

 

Written by Kenny Kerr

1 March 2013 at 2:41 pm

Posted in Uncategorized

Direct2D Fundamentals – Part 2 – Now on Pluralsight!

with 7 comments

My latest course is now available on Pluralsight!

Direct2D Fundamentals – Part 2 is so much more than Direct2D. In many ways, it’s an intense crash course in DirectX programming. You’re going to learn about Direct3D, DXGI, devices, swap chains, but it’s more than that.

This course is about developing stunning applications for Windows. You’re going to learn how to develop applications that can run cross-platform, reaching the Windows desktop, Windows Store, and Windows Phone environments. You’re going to learn about performance and efficiency. You’re going to learn how to produce high-quality animation. Moreover, you’re going to do it all in C++.

If you haven’t done so already, start by working through part 1 of Direct2D Fundamentals. This will prepare you by laying the foundation for everything you will learn in this new course.

Here’s a brief description of the modules in Direct2D Fundamentals – Part 2.

Introduction
What is Direct2D 1.1? Where can it be found? Introduction to the course demo.

Common Foundation
Building the foundation, error handling, and a common set of abstractions for a cross-platform application.

Desktop Window
The reference platform, optimizing window plumbing, and resource handlers.

Direct3D
Understanding and creating the underlying Direct3D device, hardware and software drivers, and the Direct2D device and device context.

Swap Chain
What is DXGI? What is a swap chain? Creating and resizing a swap chain. Targeting, rendering, and presentation with Direct2D and DXGI.

The Clock
How to draw an analog clock with Direct2D primitives and transforms. Adding a background image with DPI awareness. Using image effects to add a shadow.

Animation
Revisiting the message loop. Understanding swap chain occlusion and its impact on performance. Power management and efficiency. Using the Windows Animation Manager for scheduled animation.

Windows Runtime
Understanding the CoreWindow abstraction. Optimizing with visibility and the message loop. Using DirectX in the Windows Runtime.

Windows Phone
Limitations in the Windows Phone SDK. Porting the clock application, and unleashing the phone’s awesome power with the restricted API.

 

Written by Kenny Kerr

16 February 2013 at 8:00 am

Posted in Uncategorized

Creating Desktop Apps with Visual C++ 2012

with 2 comments

It’s been a busy few weeks and I’m just now catching up with some announcements.

My latest column for the February 2013 issue of MSDN Magazine is available online as well as in print.

With all the hype over Windows 8 and what are now known as Windows Store apps, I’ve received some questions about the relevance of desktop apps and whether Standard C++ is still a viable choice going forward. These questions are sometimes hard to answer, but what I can tell you is that the Visual C++ 2012 compiler is more committed than ever to Standard C++ and it remains the best toolchain, in my humble opinion, for building great desktop apps for Windows whether you’re targeting Windows 7, Windows 8 or even Windows XP.

You can find links to more of my articles here.

Written by Kenny Kerr

15 February 2013 at 8:07 am

Posted in Uncategorized

The API behind the API

with 7 comments

Despite what anyone might tell you, the Windows Runtime API is not a clean break from the past. Like .NET before it, WinRT includes a backdoor without which it would be practically useless. The Common Language Runtime’s backdoor was called Platform Invocation Services or P/Invoke for short. It was amazingly powerful, but also complex and troublesome. WinRT’s backdoor is a lot simpler. It’s called reinterpret_cast.

As I’ve already illustrated in The Road to Windows 8 and Windows 8, where’d you put my HWND?!, WinRT is projected into C++ through a set of extensions that, among other things, allows the compiler to insert code to automatically manage reference counts as if a COM smart pointer class were used. For example, every WinRT application (or at least those that run within an app container) has a CoreWindow:

auto w = CoreWindow::GetForCurrentThread();

If you were feeling verbose, you might write:

CoreWindow ^ cw = CoreWindow::GetForCurrentThread();

The ^, pronounced “hat”, defines a handle-to-object, which the compiler treats as a pointer-to-object with intrusive reference counting provided by the object’s IUnknown interface and administered by the compiler.

Without WinRT’s backdoor however, no Windows Store or Windows Phone app would be able to function. In the case of CoreWindow, the application ultimately needs to bind the application’s swap chain or composition target to the HWND that the CoreWindow represents. Without reinterpret_cast, this would not be possible. I’m obviously speaking of reinterpret_cast in the proverbial sense. You don’t need to tell me that a C-style cast will do or that it could all be done without /ZW.

CoreWindow is not alone. While ICoreWindowInterop may not be documented, another WinRT type, namely SwapChainBackgroundPanel, openly flaunts its “other” API. It is after all, the only reason for this XAML type’s existence. The ISwapChainBackgroundPanelNative interface provides the only panel method that you really need to use.

SwapChainBackgroundPanel ^ panel = …
IUnknown * unknown = reinterpret_cast<IUnknown *>(panel);
ComPtr<ISwapChainBackgroundPanelNative> native;
HR(unknown->QueryInterface(native.GetAddressOf()));

There are other examples, but the point is that there is again an API behind the API.

As developers begin to use C++/CX more, I felt it would be useful to offer a littler helper to make access to this backdoor a little more convenient, and safe.

template <typename To>
ComPtr<To> winrt_cast(Object ^ from)
{
ComPtr<To> to;
HR(reinterpret_cast<IUnknown *>(from)->QueryInterface(to.GetAddressOf()));
return to;
}

With the winrt_cast function template, you can quickly and easily reach in and retrieve the native or interop interface that a particular WinRT type might provide.

auto native = winrt_cast<ISwapChainBackgroundPanelNative>(panel);

auto interop = winrt_cast<ICoreWindowInterop>(window);

The winrt_cast function ensures that the resulting “hat-less” COM pointer is safely wrapped inside WRL’s excellent smart pointer. Since ComPtr is move-aware, returning it in this way is guaranteed not to introduce an unnecessary reference-counting heartbeat. Error handling is also not an issue. Although QueryInterface is traditionally analogous to dynamic_cast in the sense that it allows feature discovery at run-time, in this case the WinRT types are guaranteed to provide the particular interfaces. Without this capability, they would be useless. Still, it’s up to you to decide how best to deal with errors in your application or library. I tend to define HR as follows.

#ifdef DEBUG
#define HR(expression) ASSERT(S_OK == (expression))
#else
inline void HR(HRESULT hr) { if (S_OK != hr) throw Exception::CreateException(hr); }
#endif

If I’ve done something wrong, I’m treated to an assertion during development. If something goes horribly wrong at run-time, my application is quickly torn down. Obviously, as with P/Invoke, this is not something to be used lightly or without thinking. Don’t go using winrt_cast on a String^, which holds an HSTRING rather than IUnknown pointer. You can for example disable this helper for strings as follows:

template <typename To> ComPtr<To> winrt_cast(String ^);

Indeed, this solution is for those cases where you know a particular WinRT type exposes a particular COM interface outside of the discoverable type system. James McNellis offers a slightly more verbose solution, but one that is more widely applicable here. He also discusses hats in a lot more detail here.

Hope this helps.

 

Written by Kenny Kerr

26 January 2013 at 11:34 pm

Posted in Uncategorized

The Great C++ Wordament: Meet Windows

with 10 comments

This article is a reaction to some code that appeared on the Visual C++ blog a few days ago. You might say that we, James McNellis and Kenny Kerr, are addicted to writing code—preferably, fast code.

On Friday, Eric Battalio wrote about his adventures jumping back into C++. It involved a simple program to count the frequency of words contained within text files. In the comments, Stephan T. Lavavej contributed his own version that was both simpler and more effective at matching words. To be fair to Eric, Stephan is one of the foremost experts in the Standard Template Library.

Eric’s objective was to demonstrate some modern C++ techniques and to show how to complete some simple tasks using the C++ Standard Library.  Stephan’s solution built on that by replacing the word parser with the new C++11 <regex> library.  While writing the fastest possible implementation was not one of their objectives, it struck us that neither implementation was particularly fast.  We didn’t need profilers or benchmarks to determine this—it was evident from looking at the code that both implementations left buckets of performance sitting on the table.  We’ll use Stephan’s implementation as a reference implementation, the essence of which is as follows:

int main(int argc, char * argv[])
{
    std::regex const word_regex("\\w+");
    std::map<std::string, unsigned> result;

    for (int argi = 1; argi < argc; ++argi)
    {
        std::ifstream file(argv[argi]);

        if (!file)
        {
            continue;
        }

        for (std::string line; std::getline(file, line); )
        {
            for (std::sregex_token_iterator it(line.begin(), line.end(), word_regex),
                                            end; it != end; ++it)
            {
                ++result[*it];
            }
        }
    }
}

Performance analysis is an art and something that only comes with experience. It’s very tempting to assume you know what the performance bottlenecks are, but such assumptions are often wrong and one should always use a profiler to be sure. Still, looking at this code, a few things immediately jump out:

  • As much as we love regular expressions, for a match as simple as \w+ you will almost certainly achieve better performance with a trivial, hand-written implementation.
  • The Standard C++ I/O library—ios and friends—is a thing of beauty but it is not particularly fast. It is designed primarily for convenience and is most effective when handling console I/O rather than I/O in general.
  • In this implementation, the text of the file is scanned twice. First, it scans the file in the call to std::getline looking for the newline character. It then scans each line looking for words. The result is that each character in each file is visited twice.
  • Since each line is copied into a string and each word is then further copied into its own string, there is a great deal of copying and a whole lot of memory allocations.
  • Although the files obviously have no overlapping data they are scanned in sequence. The data set could easily be spread across available cores for greater throughput.

More improvements could be made. The trick is to identify some of the low-hanging fruit and consider whether they will provide meaningful improvements. It’s time to break the rules.

Mapped Files

The first step is realizing that the files in question are relatively small. We struggled to find any novels on Project Gutenberg that amounted to more than 2MB of text and settled on a handful of Dickens novels.  There’s no reason to stream such small files off the disk when you can simply map them into the address space of the process. This avoids repeated calls to the Windows API ReadFile function and the need to copy the data into user-mode buffers. This is in fact the way Windows loads executables and DLLs so presumably it’s going to be fast. File mapping objects are provided by the Windows Memory Manager, one of the most complex and efficient components of the operating system. Internally called section objects and mapped files in the Windows API, a file mapping object represents sections of an open file on disk.

wrl::FileHandle const file(CreateFile(name,
                                      GENERIC_READ,
                                      FILE_SHARE_READ,
                                      nullptr, // default security
                                      OPEN_EXISTING,
                                      FILE_ATTRIBUTE_NORMAL,
                                      nullptr)); // no template

if (!file.IsValid())
{
    // Possibly doesn’t exist – check with GetLastError.
}

In addition to the possibility of a nonexistent file, the file may in fact be empty. It’s important to check this, as the kernel won’t map an empty file. It’s also obviously useful to have the file size and we’ll need that in a moment.

LARGE_INTEGER size = { };

VERIFY(GetFileSizeEx(file.Get(),
                     &size));

if (!size.QuadPart)
{
    // file is empty
}

Given a handle to a file, the CreateFileMapping function creates a file mapping object and returns a handle to represent it. Such objects can actually refer to files that are much larger than what might fit into the address space of a process or even physical memory. Still, in this case it simpler to assume, and request through the Windows API, that the files will be mapped in their entirety. Given a handle to the file mapping object, the MapViewOfFile function actually maps it into the address space of the process. This function returns the address, a pointer, of the mapped view.

typedef wrl::HandleT<wrl::HandleTraits::HANDLENullTraits> MapHandle;

MapHandle const map(CreateFileMapping(file.Get(),
                                      nullptr, // default security
                                      PAGE_READONLY,
                                      0, 0, // match file size
                                      nullptr)); // no name

VERIFY(map.IsValid());

auto view = static_cast<char const *>(MapViewOfFile(map.Get(),
                                                    FILE_MAP_READ,
                                                    0, 0, // offset
                                                    0)); // match file size

Keep in mind that unlike CreateFile, the CreateFileMapping function does not return INVALID_HANDLE_VALUE on failure. Instead, it simply returns a nullptr value, thus the need for the alternative WRL traits class above. Of course, the primary reason that this function might fail is if the file is empty. It does not actually commit any memory so it is reasonably inexpensive and unlikely to fail if used correctly.

It helps to keep in mind that these Windows API functions create kernel objects that are reference counted. This is not unlike COM’s intrusive reference counting, at least conceptually. Unlike COM, the kernel keeps track of outstanding references held by each process and automatically releases them if the process fails to do so. When you call the CreateFile function, you effectively hold one reference to the file object. To release the reference you must call the CloseHandle function. The same goes for the CreateFileMapping function. The file mapping object however also holds a reference to the file object. You can thus release your reference to the file object and it will remain open as long as you still hold a reference to the file mapping object. It follows that the MapViewOfFile function works the same way. Holding onto just this reference will keep both the file and the file mapping object alive until you call the UnmapViewOfFile function.

if (view)
{
    VERIFY(UnmapViewOfFile(view));
}

In the complete program attached to this post, we use a simple class type to represent the mapped view of the file.  This class serves as an RAII container that automatically unmaps the view when it goes out of scope and provides access to the view as a range of bytes, via begin() and end() member functions.

Word Matching

Given that the text files are now mapped into memory, the program can simply scan the entire contents as if it were one long string. We can thus avoid scanning each file twice since there is no need to break it up into lines that are contiguous in memory. The two for loops in the original example now becomes a single loop, scanning each file just once.

for (std::cregex_token_iterator it(file.begin(), file.end(), word_regex),
                                end; it != end; ++it)
{
    ++result[*it];
}

Here file.begin() and file.end() are simply pointers delimiting the range of characters in the file (just as the begin() and end() members of a std::vector delimit the range of characters stored in the vector).  This implementation already performs much better for several reasons:

  1. we avoid the high overhead of the C++ I/O library
  2. we scan the file only once and do not make unnecessary copies of the text
  3. the characters are contiguous in memory and are scanned from beginning to end, and modern CPUs are designed to maximize performance of forward iteration over an array

However, we can still improve performance even further:  regular expressions are a wonderful tool, but there is some overhead to using them.  This overhead is especially acute with very simple expressions, like \w+.  This regular expression can be trivially implemented using a simple lexer that performs far better than the regular expression.  Here is the implementation:

char const * word_first = nullptr;

for (auto it = file.begin(); it != file.end(); ++it)
{
    if (*it < 0 || (!std::isalnum(*it) && *it != ‘_’))
    {
        if (word_first)
        {
            ++result[std::string(word_first, it)];
        }

        word_first = nullptr;
    }
    else if (!word_first)
    {
        word_first = it;
    }
}

if (word_first)
{
    ++result[std::string(word_first, file.end())];
}

This lexer makes a single pass over the text using the iterator named it.  When the lexer is scanning a word, the word_first iterator points to the initial character of the word; when it is not scanning a word (e.g. when it is scanning punctuation or whitespace), word_first is null.

The if statement tests true if the current character is not a word character.  We define word characters as letters, numbers, and the underscore (this is the same set of characters matched by \w).  For each character, we do one of three things depending on what the character is and what the current state of the lexer is:

If the character *it is a word character and…

  • …we are not currently scanning a word, then *it is the first character of a word and we set word_first = it
  • …we are currently scanning a word, then we do nothing (we simply continue with the next character)

If the character *it is not a word character and…

  • …we are not currently scanning a word, then we do nothing (we simply continue with the next character)
  • …we are currently scanning a word, then [word_first, it) is a word:  we increment the number of times we have seen it then reset word_first

Finally, if the file ends with a word, we process the last word.

Some might argue, “oh, you’re using pointers—that’s not modern C++ at all!”  Note, however, that the pointers used here are really just iterators into an array.  There are no raw pointers owning dynamically allocated objects.  There is no arithmetic that wouldn’t otherwise be valid for, say, std::vector<T>::iterator.  This particular function could be nicely wrapped up into a more generic “lexing” algorithm.

Concurrency

The final area worth exploring is whether a significant performance improvement could be made by scanning the files concurrently. It’s a reasonable assumption since the algorithm itself is compute-bound and the data is already partitioned into separate files and thus independent sections of memory. The challenge of course is to produce a single map of the results while avoiding the locking overhead required for updating it from different threads. Fortunately, the Visual C++ Concurrency Runtime (ConcRT) provides everything we might need. Still, there are many ways to solve the problem such as with explicit locks or replacing std::map with a currency-safe container. However, we prefer to avoid locks entirely.  The PPL’s combinable type is just what we need.

typedef std::map<std::string, unsigned> word_map;
ppl::combinable<word_map> shared;

The combinable type is sort of like an encapsulated thread-local storage wrapper.  We can then refer to the shared from within a parallel algorithm such as the PPL’s parallel_for_each and each thread will receive a local copy of the map.

ppl::parallel_for_each(argv + 1, argv + argc, [&](char const * name)
{
    auto & result = shared.local();

    // map and scan file …
});

Note that this is not necessarily one local copy per file but rather one per runtime thread. If the computer only has two cores and the scheduler decides that those threads are rarely blocking, which they won’t be in this case, then it’s unlikely that more than two maps will be used no matter how many files are processed. Of course, the results still need to be combined, which can be done using the combine_each function.

word_map result;

shared.combine_each([&] (word_map const & source)
{
    for (auto const & w : source)
    {
        result[w.first] += w.second;
    }
});

The lambda will be called for each local copy of the map thereby making it easy to combine the results in one simple step.  With about ten lines of code, we’ve made our program concurrent.

The End Result

Our final implementation is roughly 45x faster than the reference implementation when processing eight files on a quad core Intel i7.  Its single-threaded throughput is substantially better as well:  it is 12x faster when processing a single file, in part due to the use of memory-mapped I/O, and in part due to the use of a custom lexer.  Four large text files totaling 243MB in size were used for measurement.

Does this implementation offer the best possible performance?   Hardly.  We’ve identified a few opportunities where we might be able to improve performance:

  1. Our implementation still uses std::string, which stores a copy of the string that it represents.  Each time we construct a new std::string, a new copy is made.  These copies are all unnecessary:  since we are mapping the entire file in memory, we could simply store pointers to the words represented in the mapped range.
  2. Words are only removed from the map when the map is destroyed, so a custom allocator could be used that offers better performance for this particular usage scenario than the default heap allocator.
  3. We scan the text linearly, so it is possible to build a hash of each word as we read them, without making an additional scan over the text.  This could allow use of unordered containers or could be used to offer better ordered container insertion performance (e.g. by sorting first by hash, then by text).

We prototyped (1) and (2) and found that both yielded small performance improvements (on the order of about 7% for each of them).  (1) yielded less of a benefit than we expected because most words are small, so the small string optimization (SSO) allows most string constructions not to require heap allocation.  However, both of these changes resulted in code that was substantially more complex than our final implementation, so we chose not to include them.

One of our goals was to demonstrate that it’s not particularly difficult to write high performance code in C++, though it does take a bit more work and some knowledge of the platform.  Our final implementation totals 148 lines, including #includes and timing code for performance measurement.  This is just under 3x longer than the reference implementation, though much of that is because we chose to implement a proper RAII container—file_view—to own a mapped view of a file.  The algorithm is more complex, but not substantially more so.

While some platform-specific code is required, it would be quite straightforward to encapsulate this platform-specific code in a library that has a platform-neutral API.  Most platforms support mapping of files into memory, they just support it differently.

Finally, no commas were harmed in the making of this documentary.

The source code: words.cpp

Written by Kenny Kerr

21 January 2013 at 11:02 pm

Posted in Uncategorized

Follow

Get every new post delivered to your Inbox.

Join 438 other followers