C9 Lectures: Stephan T. Lavavej - Standard Template Library (STL), 2 of n

Play C9 Lectures: Stephan T. Lavavej - Standard Template Library (STL), 2 of n

The Discussion

• Great stuff, but the predicate "e % 2 == 1" does not work correctly with negative numbers. I would replace it with "e % 2 != 0" or "e % 2" or "e & 1".

Here are my proposed solutions to the homework, minus the map part -- not sure about the exact spec.

```#include <algorithm> #include <vector> template <typename T, typename A> void erase(std::vector<T, A>& container, const T& element) { container.erase( std::remove(container.begin(), container.end(), element), container.end() ); } template
<typename T, typename A, typename P> void erase_if(std::vector<T, A>& container, P predicate) { container.erase( std::remove_if(container.begin(), container.end(), predicate), container.end() ); } #include <deque> template <typename T, typename A> void erase(std::deque<T,
A>& container, const T& element) { container.erase( std::remove(container.begin(), container.end(), element), container.end() ); } template <typename T, typename A, typename P> void erase_if(std::deque<T, A>& container, P predicate) { container.erase( std::remove_if(container.begin(),
container.end(), predicate), container.end() ); } #include <list> template <typename T, typename A> void erase(std::list<T, A>& container, const T& element) { container.remove(element); } template <typename T, typename A, typename P> void erase_if(std::list<T,
A>& container, P predicate) { container.remove_if(predicate); } #include <forward_list> template <typename T, typename A> void erase(std::forward_list<T, A>& container, const T& element) { container.remove(element); } template <typename T, typename A, typename
P> void erase_if(std::forward_list<T, A>& container, P predicate) { container.remove_if(predicate); } #include <set> template <typename T, typename C, typename A> void erase(std::set<T, C, A>& container, const T& element) { auto it = container.find(element);
if (it != container.end()) container.erase(it); } template <typename T, typename C, typename A, typename P> void erase_if(std::set<T, C, A>& container, P predicate) { for (auto it = container.begin(); it != container.end(); ) { if (predicate(*it)) container.erase(it++);
else ++it; } } template <typename T, typename C, typename A> void erase(std::multiset<T, C, A>& container, const T& element) { auto range = container.equal_range(element); container.erase(range.first, range.second); } template <typename T, typename C, typename
A, typename P> void erase_if(std::multiset<T, C, A>& container, P predicate) { for (auto it = container.begin(); it != container.end(); ) { if (predicate(*it)) container.erase(it++); else ++it; } } ```

• Great Video. Thanks.

I'd love to see a talk about using the STL with multi-threading. Is there a way to insert elements into a std::map like container in parallel.

• Great article! Thank you for the great technical C++ 0x content on Channel9!!

I'd love to get even more on those.

Have you considered the following topics?

- Parallel programming with STL

- MC++ adapters to STL containers.

Keep posting

• Sooner or later, I'd love to hear about rvalue references

• Great stuff!

Here's the solution I came up with for what NotFredSafe called erase_if().

This does not work for set, multiset, unordered_set, or unordered_multiset!

```#include <algorithm> #include <iterator> #include <utility> using namespace std;
template<class IterTag, class ValueType> struct remover { template<class Container, class IterType, class Predicate> static void do_remove_if(Container &c, IterType begin, IterType end, Predicate pred) { c.remove_if(pred); } };
template<class ValueType> struct remover<random_access_iterator_tag, ValueType> { template<class Container, class IterType, class Predicate> static void do_remove_if(Container &c, IterType begin, IterType end, Predicate pred) { c.erase(remove_if(begin, end,
pred), end); } };
template<class IterTag, class Key, class Value> struct remover<IterTag, pair<Key, Value>> { template<class Container, class IterType, class Predicate> static void do_remove_if(Container &c, IterType begin, IterType end, Predicate pred) { for(auto i = c.begin();
i != c.end();) { if(pred(*i)) c.erase(i++); else ++i; } } };
template<class Container, class Predicate> void erase_if(Container &c, Predicate pred) { typedef iterator_traits<Container::iterator>::iterator_category iter_cat_t; typedef remover<iter_cat_t, Container::value_type> remover_t; remover_t::do_remove_if(c,
c.begin(), c.end(), pred); } ```

The remover struct provides the template machinery so that there only needs to be one definition of erase_if(). I prefer this to having an overload for each container type, although the fact that it doesn't work for the set containers shows that this isn't the perfect solution either

It would be nice to have a more comprehensive set of traits classes in the STL to make writting this kind of code easier.

• great lecture and i want more much much more STL videos with c++0x goodness

btw, will the initializer list be implemented in visual studio soon ? like in a coming service pack or small update or do we have to wait several years for the next visual studio release ? I'm very tempted to go over to gcc that have most of  the c++0x implemented already, see http://gcc.gnu.org/gcc-4.6/cxx0x_status.html, yet another reason of many others for open source, it's much faster releases and updates.

*hint  hint* visual studio dev teams, more frequent releases / updates please!

There are lots of new c++0x goodness i want to test and play with but cant.

• [NotFredSafe]
> the predicate "e % 2 == 1" does not work correctly with
> negative numbers. I would replace it with "e % 2 != 0"

Good catch. I was going to write != 0, but while I was talking I thought that removing the negation would make things easier to understand. Fail!

> Here are my proposed solutions to the homework

Your erase(vector<T, A>&, const T&) requires the vector's element type and the given value's type to be identical. If they're even slightly different, template argument deduction will fail - for example, when you try to erase 0 from a vector<unsigned char>, 0 is of type int. std::remove(first, last, value) doesn't require *first and value to have the same type; in fact, they can be radically different, as long as elem == value works.

This applies to erase(deque<T, A>&, const T&) too.

erase([forward_]list<T, A>&, const T&) isn't quite the same, because [forward_]list<T, A>::remove() does take const T&. However, because of template argument deduction, you still want to permit different types here.

Your erase([multi]set<T, C, A>&, const T&) could be more efficient. [multi]set<T, C, A>::erase(const T&) is the dedicated member function for this. Again, you'll want to permit different types.

Good work on avoiding iterator invalidation in erase_if([multi]set<T, C, A>&, P). Many people get that wrong.

[benyaboy]
> I'd love to see a talk about using the STL with multi-threading.

I may be able to cover this, but I'd need to introduce the Parallel Patterns Library (or the lower-level boost::thread, since std::thread won't be available until VC11).

> Is there a way to insert elements into a std::map like container in parallel.

Not directly. map, like everything else in the STL, supports "the usual thread safety guarantee". That is, multiple threads can simultaneously read a single object, and they can simultaneously read/write different objects. Const operations are reads and non-const operations are writes (construction and destruction count as writes).

[JeanGa]
> Great article! Thank you for the great technical C++ 0x content on Channel9!!
> I'd love to get even more on those.

You're welcome, and more are on their way.

> Have you considered the following topics?
> - Parallel programming with STL

Looks like there's some demand for this. It's relatively advanced (first you have to be able to use the STL in serial programming) but I'll try to cover it in a few more parts.

> - MC++ adapters to STL containers.

I don't do managed code, sorry.

[NotFredSafe]
> Sooner or later, I'd love to hear about rvalue references

I'll probably refer to them occasionally, but first I have to cover the basics of using the STL.

[philgates]
> Here's the solution I came up with for what NotFredSafe called erase_if().

Random-access containers with elements that are pairs, like vector<pair<A, B>>, will confuse your partial specializations of remover. Additionally, bidi-or-weaker containers with elements that are pairs, like list<pair<A, B>>, will select your second partial specialization, when you really want your primary template to be selected.

(Also, you're missing a few typenames in erase_if(), although VC isn't very good about requiring their presence.)

> It would be nice to have a more comprehensive set of traits classes in the STL to make writting this kind of code easier.

This homework problem actually highlights one of the least generic parts of the STL: its containers have several different erasure behaviors.

[Mr Crash]
> great lecture and i want more much much more STL videos with c++0x goodness

> btw, will the initializer list be implemented in visual studio soon ?

Magic 8 Ball says: "Ask again later."

(I really want initializer lists, but they're actually not in my top three list of desired features. Also, please ignore the <initializer_list> header that I accidentally left in VC10 RTM. It doesn't do anything.)

> like in a coming service pack or small update

You shouldn't expect new features to appear between major releases. We make exceptions to this rule on a case-by-case basis, but they are rare. Doubly rare for the compiler.

Stephan T. Lavavej, Visual C++ Libraries Developer

• Great lectures!
Looking forward to your next "n" STL sessions

• Thanks for your feedback, here is my updated solution:

```#include <algorithm> #include <vector> template <typename T, typename A, typename V> void erase(std::vector<T, A>& container, const V& value) { container.erase( std::remove(container.begin(), container.end(), value), container.end()
); } template <typename T, typename A, typename P> void erase_if(std::vector<T, A>& container, P predicate) { container.erase( std::remove_if(container.begin(), container.end(), predicate), container.end() ); } #include <deque> template <typename T, typename
A, typename V> void erase(std::deque<T, A>& container, const V& value) { container.erase( std::remove(container.begin(), container.end(), value), container.end() ); } template <typename T, typename A, typename P> void erase_if(std::deque<T, A>& container,
P predicate) { container.erase( std::remove_if(container.begin(), container.end(), predicate), container.end() ); } #include <list> template <typename T, typename A, typename V> void erase(std::list<T, A>& container, const V& value) { container.remove(value);
} template <typename T, typename A, typename P> void erase_if(std::list<T, A>& container, P predicate) { container.remove_if(predicate); } #include <forward_list> template <typename T, typename A, typename V> void erase(std::forward_list<T, A>& container,
const V& value) { container.remove(value); } template <typename T, typename A, typename P> void erase_if(std::forward_list<T, A>& container, P predicate) { container.remove_if(predicate); } #include <set> template <typename T, typename C, typename A, typename
V> void erase(std::set<T, C, A>& container, const V& value) { container.erase(value); } template <typename T, typename C, typename A, typename P> void erase_if(std::set<T, C, A>& container, P predicate) { for (auto it = container.begin(); it != container.end();
) { if (predicate(*it)) container.erase(it++); else ++it; } } template <typename T, typename C, typename A, typename V> void erase(std::multiset<T, C, A>& container, const V& value) { container.erase(value); } template <typename T, typename C, typename A,
typename P> void erase_if(std::multiset<T, C, A>& container, P predicate) { for (auto it = container.begin(); it != container.end(); ) { if (predicate(*it)) container.erase(it++); else ++it; } } ```

• Great lectures, much better presentation than other MSDN videos I've seen.

Never having used much STL, I'm surprised that map doesn't have a const operator[], returning a const reference. Is this because things would get tricky if the supplied key isn't present in the map? Would an exception not suffice here? It could be considered undefined behaviour, just like the out-of-range vector subscript you highlighted in the first video. Perhaps the intention is to guide programmers down a more robust path, since find() can indicate 'not found' through its return value without resorting to exceptions or asserts.

• Another channel9 classic!

Thanks again Stephen!

• Thanks for such a good talk. The introduction to map was really good, I had mostly used sets before this.

A lot of folks come from C or do not use deep C++ (like templates & STL). If you can talk about some useful idioms that can pull them into STL (and why they should), that would be cool! IMO this would also reach out to a huge audience, all those who do C-like or C-with-classes style of programming in C++.

One example is how C-style arrays can be used almost everywhere in STL-land as vectors and thus gain the ability to be sorted and searched for free by using STL.

• [NotFredSafe]
> Thanks for your feedback, here is my updated solution:

Looks good to me. My solution will be functionally equivalent to this, although I'll handle a few more containers.

[Andrew McDonald]
> Great lectures, much better presentation than other MSDN videos I've seen.

Thanks - that means a lot to me.

> Perhaps the intention is to guide programmers down a more robust path,
> since find() can indicate 'not found' through its return value without resorting to exceptions or asserts.

Bingo.

[ashwinn]
> Thanks for such a good talk. The introduction to map was really good, I had mostly used sets before this.

The great thing about map and set is that because they're backed by the same data structure and they have almost the same interface, after you learn how to use one, all of your knowledge transfers over to the other.

Similarly, after you've learned how to iterate through one STL container, you've learned how to iterate through them all (as long as you're using != instead of < of course; I'll use < for indices but never for iterator loops).

In math and physics, the realization that "all of these different things are really the same thing" is absurdly powerful. The STL, as designed by Stepanov, was strongly influenced by this concept. Explained poorly*, the STL can seem too abstract and "too mathy", but my aim in these videos has been to demonstrate that this mathematical influence is actually a good thing. Almost all of the stuff in the STL behaves consistently, so it's easy to learn and easy to remember.

(* For example, starting off with iterator categories. I've mentioned them in passing, but understanding them isn't necessary to begin using vectors, which have iterators equivalent in power to pointers. I love abstraction, but the way my brain works, I want to start off by understanding something concrete before generalizing. This is also why I explain lambdas differently than other people do. Other people start ranting about closures and captures and blah blah blah. I repeatedly hammer just one idea: lambdas stamp out function objects, which are classes with data members (sometimes), and everybody knows how classes and data members work.)

> One example is how C-style arrays can be used almost everywhere in STL-land
> as vectors and thus gain the ability to be sorted and searched for free by using STL.

As I recall, I mentioned and even demonstrated this in Part 1.

Stephan T. Lavavej, Visual C++ Libraries Developer

• Shouldn't code from your assignment be part of the STL?

If some generic, helper code gets written by many (most) users of some library just to make using this library safer and easier it's a sign the library is missing this code...

I'd like to see episode on how and why the STL has changed over the last couple of years (I think of LWG's work) and what's planned in the future.

Thanks for popularizing C++.

• Hi,

why would vector need a linear time (38:10) to re-condense itself after erasure of an element? It is contiguous, and all it holds are pointers - why couldn't memmove() be used just to move them all back one notch at once, in constant time? Of course this is only for the case of non-inlined values, where you have an extra indirection level necessary for that, just like for the rvalue references. I guess in its quest for "efficiency" STL stores "light" values inside vector cells themselves, instead of boxing them up and storing just a pointer to the actual value on heap. Still, the implementation could distinguish between the two cases, right?

• > why couldn't memmove() be used just to move them all back one notch at once, in constant time?

Erm... "one notch at a time" and "in constant time" contradict each other The time required to move bytes around depends on the number of bytes, thus it cannot be constant time. If you remove one pointer from a vector of 1000 pointers, 999 pointers have to be moved, right? That's linear time, not constant time.

• "move them all back one notch [,] at once". I always thought of memmove() as an atomic operation (unless we move really big chunks of memory). Just like Stephan says, CPUs like to deal with contiguous blocks of memory, which fit in their caches - then moving one or a 1000 takes the same time. Supposedly. The key here is we're not moving the 999 ptrs one by one - they all live in one contiguous block of memory, of size 4000 or 8000 bytes (32- or 64-bit ptrs), so we just need to move one memory block of size 999*4 , 4 bytes down. The target and source regions overlap, but that's OK (I think).

• Unless you have very specialized hardware, I fail to see how an operation involving an arbitrary amount of bytes can happen in constant time, but then again, I am not an expert on hardware.

• Here's how erase() is actually implemented in VS2008:

for (; _First != _Last; ++_Dest, ++_First)

*_Dest = *_First;

return (_Dest);

So you see, here each element is copied, one by one - and this involves copy constructor in C++ which will actually create new object to be copied (the problem which move semantics came to address). This is obviously unimaginably worse than just calling memmove() once for each 100,000-long block of pointers or so (provided that the vector does actually store pointers to the actual objects somewhere on the heap) - even if pointers get copied one by one inside memmove(), still no temporary object creation/destruction is going on at all. That's exactly what move semantics does. And that's just what memmove() does, isn't it?

• [Piotr Dobrogost]
> Shouldn't code from your assignment be part of the STL?

Maybe. Historically, the STL has favored minimal interfaces with maximal genericity. There's no technical reason why it couldn't provide convenience functions for lots of things.

> I'd like to see episode on how and why the STL has changed over the last
> couple of years (I think of LWG's work) and what's planned in the future.

I'll be introducing shared_ptr soon.

> Thanks for popularizing C++.

Thanks for watching.

[Will48]
> why would vector need a linear time (38:10) to re-condense itself after erasure of an element?

Because it obeys the laws of physics.

> and all it holds are pointers

vector's representation looks like this (I would link to my BoostCon 2010 presentation, but their website is down):

T * m_begin;
size_t m_size;
size_t m_capacity;

m_begin points to a contiguous chunk of memory that has m_size initialized elements and ultimately has space for m_capacity elements (at that point, the chunk of memory will be full, and the vector must undergo reallocation in order to store any more).

Our vector has been implemented slightly differently (our T * _Myfirst is m_begin, our T * _Mylast is m_begin + m_size, and our T * _Myend is m_begin + m_capacity; this representation is equivalent and more convenient for us), but this model is perfectly valid and it's the one you should keep in your head.

> why couldn't memmove() be used just to move them all back one notch at once, in constant time?

memcpy() and memmove() take O(N) time (i.e. linear time) to copy N bytes. (They do the same thing, except that memcpy() forbids overlapping source and destination, while memmove() permits that.) They can be implemented with ultrafast assembly code, but they have the same asymptotic complexity as a handwritten for-loop. It's not possible for them to be faster than O(N) with our computer architecture.

> Of course this is only for the case of non-inlined values, where you have an extra indirection level necessary for that,

There appears to be serious confusion here. Please watch Part 1 carefully - I'm pretty sure I sketched all of this out with sufficient clarity there.

> I guess in its quest for "efficiency"

The STL cares about wall-clock efficiency. No need for scare quotes.

> STL stores "light" values inside vector cells themselves, instead of boxing them up and storing just a pointer to the actual value on heap.

vector does no such thing. (You may be thinking of the Small String Optimization.)

> I always thought of memmove() as an atomic operation

It's not constant time. If you graph the number of cycles versus the number of bytes, it'll look like a straight line (possibly jagged, since it's strongly influenced by processor behavior).

Please note that the word "atomic" has a very special meaning in multithreading, and memmove() is certainly not atomic.

> Just like Stephan says, CPUs like to deal with contiguous blocks of memory, which fit in their caches

This is true, but I should clarify: computers love sequential memory access, even when the memory being accessed is bigger than the caches. If everything fits into cache, that's really awesome, but even if it doesn't, things will still be really fast. (The reason is latency versus bandwidth. By the standards of CPUs, main memory has terrible latency but reasonable bandwidth. When accessing memory sequentially, you're limited by bandwidth. When accessing memory randomly, you're limited by latency. Caches, which have good latency and insanely high bandwidth, attempt to hide how bad main memory is.)

> then moving one or a 1000 takes the same time.

This is not true.

> Here's how erase() is actually implemented in VS2008:

vector<T>'s smarter than you think. When T is scalar (int yes, double yes, std::string no), our implementation uses memmove() instead of for-loops. In VC10, the callstack looks like this:

std::_Move<int *,int *>(int * _First, int * _Last, int * _Dest, std::_Scalar_ptr_iterator_tag __formal)
std::_Move<int *,int *>(int * _First, int * _Last, int * _Dest)
std::vector<int,std::allocator<int> >::erase(std::_Vector_const_iterator<std::_Vector_val<int,std::allocator<int> > > _Where)
main()

template<class _InIt,
class _OutIt> inline
_OutIt _Move(_InIt _First, _InIt _Last,
_OutIt _Dest, _Scalar_ptr_iterator_tag)
{    // move [_First, _Last) to [_Dest, ...), pointers to scalars
ptrdiff_t _Count = _Last - _First;
_CSTD memmove(&*_Dest, &*_First,
_Count * sizeof (*_First));
return (_Dest + _Count);
}

The _Scalar_ptr_iterator_tag machinery is how scalars (int/double/etc.), for which this optimization is safe, are distinguished from things for which it is not (std::string/etc.). As I may have mentioned, we could be slightly more aggressive here, but currently we aren't.

(This machinery is advanced, so I won't be explaining it in the videos any time soon. If anyone really wants extra credit, try to figure out where we're spending unnecessary cycles and instructions in the above function. Fixing this is on my todo list for VC11.)

> And that's just what memmove() does, isn't it?

memmove() is faster than a for-loop, but it's still linear time.

Stephan T. Lavavej, Visual C++ Libraries Developer

• Stephan, couldn't erasing the first element from a vector be done in constant time like this?

`template <typename T> void vector<T>::erase(vector<T>::iterator it) { if (it == m_begin) { it->~T(); ++m_begin; --size; --capacity; } else { // normal erase } } `

(Of course, we would need another member m_physical_begin to delete the block correctly later.)

• Such a scheme would have an odd effect: erasing the first element would reduce the vector's capacity, but erasing any other element would not. (It would also increase the size of vector objects by 33%.)

The Standard forbids std::vector from being implemented like that. You could always write your own container that does that, but it would be much better to use std::deque or std::list when you need push_front() and pop_front() (of course, they are not contiguous). boost::circular_buffer may also be appropriate.

If you simply have to talk about contiguous subranges (i.e. you don't want to look at some elements, but you also don't need to physically erase them), you can use pair<vector<T>::iterator, vector<T>::iterator> or pair<T *, T *>.

• Great video - very clear, informative and well presented!  Good job!

As for part 3, an interesting thing to cover might be unique_ptr - I've found it very useful, especially in a vector<unique_ptr<T>> where T is a complex type.  The guarantees are interesting - you get random access to elements, exception safety, and the elements never move memory address even if the vector reallocates.  (A common bug I've seen is someone takes the raw address of an element in a vector, does something which reallocates the vector, then the pointer is invalid - but only sometimes, when the vector happens to reallocate.)  Perhaps that could tie in to rvalue references if you mention it.

• yes memmove() is linear, but with incredibly smaller constant factor (to the point, I was hoping, of insignificance) than assign/copying objects one by one in the loop. My point was to use indirection always (with T** m_begin), thus enabling to call memmove() instead (i.e. T* is scalar just like int, and std::string* is scalar too). Then each erase() will take very small amount of time hopefully. I'm guessing but I think for vectors of lengths in low thousands it would make the "dumb" approach acceptably fast. Of course the smart algo with two iterators doing all the work in one pass is always better. Thanks for the explanations.

• [AshleysBrain]
> As for part 3, an interesting thing to cover might be unique_ptr

Thanks for the reminder - shared_ptr and unique_ptr definitely go together.

[Will48]
> My point was to use indirection always

Then the vector's elements wouldn't be stored contiguously. That would significantly reduce its performance and break a useful guarantee.

• > yes memmove() is linear, but with incredibly smaller constant factor

Constant factors do not matter in complexity theory. 0,01n is still linear.

• yes, but in practice they do.  But Stephan did clarify something to me (that should've been altogether clear anyway) - with indirection the objects indeed won't be stored contiguously. So adding/erasing elements at random positions would improve, but access would become slower. So I guess it depends, what a given vector is mainly used for. If its elements get shuffled a lot by a given algorithm, maybe my scheme would still be beneficient.

• "(This machinery is advanced, so I won't be explaining it in the videos any time soon. If anyone really wants extra credit, try to figure out where we're spending unnecessary cycles and instructions in the above function."

Isn't that micro/cycle optimization, that's what you're talking about right ?

Ex. removing size calculation, perhaps remember it (when allocating) and keep it in memory.

(i'm just guessing, this isn't my field of expertise so be nice)

But i do want to hear about how you're going to optimize it ( optimization stories/talks are very interesting )

• >std::thread won't be available until VC11

For those who can't wait: my (commercial) just::thread library provides std::thread and the rest of the C++0x thread library for VC10:

• This is not a free library. In this case, given that it's directly related to this topic, I do not consider this to be spam, but you should make it clear that this is a pay-for-use library, Anthony.

C

• Is boost::thread a valid free alternative?

• Thanks for posting this great 2nd lecture, Stephan.  You're giving us all a fantastic look at the STL.

• [Will48]
> If its elements get shuffled a lot by a given algorithm, maybe my scheme would still be beneficient.

One of the great things about the STL is that it lets you be very precise when requesting data structure representations. In this case, you're thinking about vector<unique_ptr<T>> or vector<shared_ptr<T>>.

[Mr Crash]
> Isn't that micro/cycle optimization, that's what you're talking about right ?

Yes, it's micro-optimization. In the absence of profiling data proving that you're looking at a bottleneck, micro-optimization is usually a bad idea, but the Standard Library is one of the few places where it's reasonable. Because it has so many users, it's likely that someone somewhere will really care about each individual chunk of code.

> But i do want to hear about how you're going to optimize it ( optimization stories/talks are very interesting )

If you look at the generated assembler, it's performing unnecessary pointer arithmetic. memmove() works on bytes, so we can use unsigned char * internally. My usual saying is "you know what they say about casts, they conceal something broken", but when wielded by experts they are useful tools.

[Sys64738]
> Is boost::thread a valid free alternative?

Yes - it served as the basis for std::thread, and is actively maintained again.

[BigHands]
> Thanks for posting this great 2nd lecture, Stephan. You're giving us all a fantastic look at the STL.

I'm glad you find it useful. We're filming Part 3 on Friday, August 6.

• "it's likely that someone somewhere will really care about each individual chunk of code."

well i think that every single chunk of code should be as perfect as possible. "Every piece of code should get its fair share of love"

Standard Library is such a vital and important part that it should be micro-optimized.

"it's performing unnecessary pointer arithmetic"

How did you find it ?

Do you use a (special ?) profiler or do you go over the assembler code manually (hardcore style) ?

Perhaps make a video of finding and optimizing (hard to find) bottlenecks ?  (And perhaps teach how to avoid common mistakes)

• [Mr Crash]
> well i think that every single chunk of code should be as perfect as possible.

> "Every piece of code should get its fair share of love"

All code should be correct. Most code should not be micro-optimized. Premature optimization is the root of all evil, and constantly worrying about cycles and bytes tends to damage code's correctness and make it harder to understand.  Avoiding gratuitous pessimization is also important, but that's different.

> How did you find it ?

I was looking at that code for another reason, and wondered how efficient it was, so I dumped the generated assembly (compile with /FAsc).

> Do you use a (special ?) profiler or do you go over the assembler code manually (hardcore style) ?

I've used VS's profiler, and I've looked at assembly, but I'm not a profiling expert, and I can barely read assembly.

> Perhaps make a video of finding and optimizing (hard to find) bottlenecks ?

> (And perhaps teach how to avoid common mistakes)

I've already talked about asymptotic complexity (getting that right is very important - all of the micro-optimization in the world won't save you if you're using a quadratic algorithm when a linear one with reasonable constants is possible).  When it comes to shared_ptr I'll be talking about make_shared<T>()'s performance impact.  However, because this is an introductory series, I will be strongly focusing on writing correct code. When you can do that quickly and easily, you have more time left to improve the performance of your application.

I wonder if we could get one of our performance gurus, Bruce Dawson (from IE Perf, formerly from Xbox Perf) to do a Channel 9 video.  Maybe he already has and I haven't heard about it.

• Good idea re Bruce.

C

• clipboard paste bug!
ignore this comment, i can't delete it

First paste is invisible, for some strange reason, which caused problems

• sorry about previous post, bugs!!

damn buggy comment system !

I can't paste in my answer, not even in IE :o

sigh i give up, comment system hates me
update: i figured it out, bug workaround, third time's the charm

• I'm curious, what is the notepad clone you use called ?
Also i'm guessing you use the command line to compile ?
To me that seems cumbersome, i'm thinking lots of code files, project files, etc..
erm maybe you use make files instead of vs2010 project files since manually writing project files is wasting time and not really practical.
> I've already talked about asymptotic complexity (getting that right is very important - all of the micro-optimization in the world won't save > you if you're using a quadratic algorithm when a linear one with reasonable constants is possible).
Since you mentioned it, perhaps you can teach how to measure/analyze algorithms to figure out if they are linear, quadratic, log n, etc..
When i hear log n, linear, quadratic, i get a bit confused since i don't come from a computer science background, so can you explain a bit more and go into a bit how to measure algorithms, or is this a too complex topic for this series ? A hint for future videos perhaps
> I was looking at that code for another reason, and wondered how efficient it was, so I dumped the generated assembly (compile with /FAsc).
That's good to know, thanks
Is this how you normally/a good way to analyze bottlenecks (compile and go over the assembly) ? Or are there only special cases where this is necessary, what are the hint to look for that will tell you that this technique is needed ?
> I wonder if we could get one of our performance gurus, Bruce Dawson (from IE Perf, formerly from Xbox Perf) to do a Channel 9 video.  Maybe he already has and I haven't heard about it.
- Warning for microsoft lovers, beware
Is this one of the guys that worked on IE 8 ?
I know you understood me but well lets just talk about the elephant in the room. I meant good performance as in fast, not how to make the worst kind of bloat ware possible that you can't uninstall completely (always some crap left behind because it's "needed") and that you can't remove anyway since you can't get to the update site without IE (microsoft really knows how to force people to use their products, just make it impossible for the competition to actually compete, *hint hint* EU here's your next target )
Oh my this elephant is big but lets stop here
So to the point: is he the right guy ? a real performance guru ? if he's a IE performance gurus aka bloat ware guru  then i think i'll pass.
No offensive meant just pointing at the obvious elephant in the room.
- Warning End
Yes, I am bitter about IE and the IE exclusive update site (to just name a few microsoft failures)
• > I'm curious, what is the notepad clone you use called ?

Metapad. It wraps a Rich Edit control, so it's more powerful than Notepad which wraps an Edit control.

> Also i'm guessing you use the command line to compile ?

VS (like Windows and Office) is built from the command line. "build -cZP" kicks off our build (and I don't pretend to understand how it works - I know how to hack bits and pieces of it). When compiling examples or test cases (consisting of a single file, or a small handful at most), I also use the command line. "cl /EHsc /nologo /W4 meow.cpp" is my minimal command line; it enables conformant exception handling, silences a noisy banner, and uses warning level 4. At home, I use handwritten makefiles (which are smarter than they sound, since they automatically build new files and automatically generate dependencies). I can't tolerate build system complexity (I have a limited amount of patience for complexity, and I need to spend it all on C++), so I avoid it as much as possible.

> Since you mentioned it, perhaps you can teach how to measure/analyze algorithms to figure out if they are linear, quadratic, log n, etc..

Computational complexity is a vast field, but the basics are pretty easy to learn, and the advanced stuff almost never comes up in day-to-day programming.

You've got a problem of size N. N is the amount of stuff you have to deal with. For example, you've got a vector of N elements. Then, you write an algorithm to do stuff to the stuff. That algorithm is going to take some amount of time to do its stuff. In the real world, we care about wall clock time (the actual amount of time that it takes). That's messy, because some fundamental operations are faster than others. (For example, incrementing a register takes a cycle or less, while dereferencing a pointer may take many cycles if the destination isn't cached and we have to go all the way to main memory.) Fortunately, when analyzing asymptotic complexity, we get to ignore the details. All we have to do is count the fundamental operations that the algorithm performs.

It gets better - we get to ignore almost everything. Imagine that we can prove that the algorithm will perform at most 3 N^2 + 17 N + 42 operations (we're typically interested in the worst case scenario, but it's also possible to analyze the "expected" scenario for things like hash containers). We keep the highest power of N, and we throw away its coefficient too. So, that algorithm would be O(N^2). (This is http://en.wikipedia.org/wiki/Big_O_notation .)

In practice, if the algorithm isn't especially twisty, you can usually look at what it's doing and figure out the complexity without doing anything like counting operations.

O(1), constant time - This means that the time taken (number of operations performed) doesn't depend on the problem size N. It could still be big, but it won't get bigger as N gets bigger. Fundamental operations (increments, dereferences, everything in the core language) are obviously constant time. Using op[]() to get the Ith element from a vector of size N is constant time - the vector takes a pointer, adds I to it (constant time), then dereferences it (also constant time), and gives you the result. Because we throw away coefficients, we don't call this O(2) or anything like that, it's just O(1), because it doesn't depend on how big N is.

Coefficients matter, but constant time can be thought of as blazingly fast.

O(log N), logarithmic time - This often comes up when dealing with balanced binary trees. If you put N elements in a balanced binary tree, it's going to be log N nodes tall. (A bit of math: because we're throwing away coefficients, it turns out that we don't care about the base of the logarithm. That's why we don't bother writing it out, although usually the base is 2.) map/multimap/set/multiset are balanced binary trees, so their find() is O(log N), because it has to walk from the root to a node, and that node might be at the bottom of the tree. Their data structure also guarantees O(log N) insertion and erasure - this part is magic (you have to carefully follow the algorithm to prove that it works and has that complexity), but you can treat it as a given since the STL has already been implemented for you.

Binary searches (e.g. std::lower_bound() on a sorted vector) are also logarithmic time.

Logarithmic time is still very fast. If you have a map with 1000 elements, it's about 10 nodes tall. If the map grows to 1,000,000 elements, it's only about 20 nodes tall. Your problem got 1,000 times bigger, but your algorithm got 2 times slower.

O(N), linear time - This one should be the easiest to understand. It's what happens when you take N elements and do something that takes constant time to each of them. std::count_if() is one of many examples. It looks at each element, and counts how many satisfy some predicate (a function object taking an element and returning bool).

O(N log N), somewhat jokingly called linearithmic time - In general, sorting N elements takes O(N log N) time. It can't be faster (although you can write algorithms that are slower, obviously), unless you know something special about the elements (like, they're the suffixes of a string). std::sort() is now guaranteed to be O(N log N). The algorithm quicksort is usually O(N log N) (and its coefficients are small, meaning that it's actually quite fast in wall clock time), but it sometimes takes O(N^2) on "bad" or pathological inputs. In C++98/03, std::sort() was allowed to display that behavior, but it is now required by C++0x to never be slower than O(N log N). (VC has done this since at least VC8.)

Because log N is cheap, N log N is comparable to linear. It's not quite as good, of course.

O(N^2), quadratic time - This happens when you take N elements, and do something that requires O(N) time to each of them. For example, naively scanning for duplicates has this behavior.

Quadratic time is generally bad (in some applications, quadratic is the best you can do, but if you can avoid it you generally should). It has the undesirable property of making algorithms take forever for seemingly small inputs. For example, suppose N grows from 1000 to 10,000 (maybe it's the number of mail messages in my account). The problem became 10 times bigger, but a quadratic algorithm would be 100 times slower. Users will notice that.

There are other notable complexities (O(2^N), exponential time, makes quadratic look fast), but these are the common ones encountered every day.

There's lots on Wikipedia about this - http://en.wikipedia.org/wiki/Time_complexity seems like it knows what it's talking about - but that's a quick summary of my understanding of this area.

> Is this how you normally/a good way to analyze bottlenecks (compile and go over the assembly) ?
> Or are there only special cases where this is necessary, what are the hint to look for that will tell you that this technique is needed ?

Profiling will tell you where your time's going. These days, common offenders are things like the network and the disk, external stuff that takes forever. If the culprit is a poor algorithm, fixing that will be a massive win - that's macro-optimization. Understanding asymptotic complexity makes it easier for you to identify and fix poor algorithms, and especially to avoid writing them in the first place.

If computation is actually your bottleneck (as opposed to non-computational things like the network), and your algorithm is smart, then it's time to look at codegen, and assembly is a good way to do that.

> Is this one of the guys that worked on IE 8 ?

He's working on IE9. I'm not sure when on the IE timeline he joined the team.

> So to the point: is he the right guy ? a real performance guru ?

If you think I know what I'm doing, then you should believe me when I say that he knows what he's doing.

• Thank you for taking the time to answer, i highly appreciate it.

Your explanation cleared things up , thank you.

Those handwritten makefiles sounds interesting, i woud have liked a peek, for the geek of it

> I can't tolerate build system complexity (I have a limited amount of patience for complexity, and I need to spend it all on C++)

Yes i totally agree, the build system shouldn't get in the way.

If you think I know what I'm doing

Oh yes i do you do know what you're talking about

> then you should believe me when I say that he knows what he's doing

Ok, i'm  just checking, i do love a good browser fight, we ( the users ) are all winners in the end

• STL: Thanks for your excellent *practical* explanation of computational complexity!

I read several more mathematical-oriented explanations, but they were too abstract and more "smoky", more vague; instead yours is very clear and effective.

About performance, I'm looking forward to the interview with Bruce Dawson (BTW: I agree the performance of IE8 is not very good - I do hope that his contributions will make a difference in IE9).

Speaking of performance, I used to think that when you really want the top performance, you should abandon STL and use a more C-like programming style. For example, there was an interesting series of blog posts about a Chinese/English Dictionary reader:

https://blogs.msdn.com/ricom/archive/2005/05/10/416151.aspx

The original native C++ version used C++ standard library, and was slower than the original managed C# version.

After some optimizations, the C++ version became faster than the managed C# version, but:

[...] Of course he used available lower level libraries to do this, but that's still a lot of work.  Can you call what's left an STL program?  I don't think so [...]

So, is it correct that when top performance is required (e.g. in a tight loop executed lots of times or in some other critical portion of code) we should abandon STL and use lower level C-like programming styles? Or is it possible to still use STL successfully in these scenarios?

In Microsoft you guys write software used by millions of people worldwide: do you use STL in performance critical parts of your software?

Thanks.

• > STL: Thanks for your excellent *practical* explanation of computational complexity!

You're welcome.

> So, is it correct that when top performance is required (e.g. in a tight
> loop executed lots of times or in some other critical portion of code)
> we should abandon STL and use lower level C-like programming styles?

As I mentioned in Part 1, the STL is a general library for pure computation. Because it was designed by a genius (Stepanov) and implemented by experts (e.g. Dinkumware) who can wield the full power of the Core Language (and the occasional magic compiler hook), it's extremely efficient in a vast range of scenarios (when used properly, as this series is trying to introduce). Whenever it uses algorithms (e.g. search_n()), data structures (e.g. map), or implementation tricks (e.g. make_shared<T>()) that are beyond the reach of non-library programmers, it can easily outperform their handwritten low-level code. However, because the STL is a general library, it can't be perfectly tuned for every scenario. With enough skill and effort, handwritten low-level code can always outperform it. That has costs, though - not only the skill and effort required, but the resulting fragility and resistance to modification of the low-level code. The job of an STL maintainer (like me) is to minimize the frequency with which library users have to resort to low-level code.

The vast majority of a typical C++ program should use the STL. In hotspots where the STL is introducing overhead (the vast majority of code is not a bottleneck, and in some bottlenecks the STL will not be imposing any overhead - as few as possible if your STL maintainers have done their jobs well), or where you have special knowledge that a dedicated algorithm or data structure would do better, you can use handwritten low-level code. Very, very rarely, you have to go all the way to assembly. (One obvious example is the inner loop of a video encoder.)

Note that because the STL is a library, not a framework, it never gets in your way when you need to fall back to lower-level code. It just wants to help you avoid that as much as possible.

In your linked blog posts, Raymond Chen was bottlenecked by iostreams' getline() and code conversions. iostreams (which is part of the C++ Standard Library, but not what I call "the STL proper") is a disgusting mess in terms of implementation and performance. It is widely agreed (even by Standardization Committee members) that the specification of iostreams makes it really, really difficult to write efficient implementations. (The main offenders, as I understand it, are lots of virtuals and lots and lots of customization points.) We fixed some iostreams perf problems in VC10 (replacing single-character operations with multi-character operations) and we're fixing more in VC11 (replacing locks with interlocked operations), but it's like pushing a cubic kilometer of jello.

Another bottleneck has been solved; in part 5, Raymond talked about copying strings. C++0x/VC10's move semantics fixes this, usually automatically, sometimes with a move() or two. (Unnecessary copies were the biggest performance problem with the STL proper in C++98/03.) Even constrained by the C++98/03 Core Language in VC8 and VC9, our STL implementation did the best that it could. When faced with a vector<string> or vector<vector<T>> (or any element type itself from the STL), vector would swap instead of copying during reallocation (which I called the Swaptimization).

> In Microsoft you guys write software used by millions of people worldwide: do you use STL in performance critical parts of your software?

Some teams do. For example, the compiler front-end (C1XX, which parses C++) uses the STL extensively. Other teams have what we politely refer to as "legacy codebases"; for example, the compiler back-end (C2, aka UTC, which optimizes and generates assembly) is essentially C. They're slowly converting it to C++. Then there are teams that are suffering from exception denial - Outlook (my former team) was an example. After I left, though, they did the work necessary to enable exception handling, and are now beginning to use the STL - replacing some of their data structures with the STL significantly increased performance.

In fact, during VC8 and VC9, the major performance headache with the STL was that we enabled our security checks in release mode by default (_SECURE_SCL), which in worst-case scenarios could harm perf by around 2x. That has been solved in VC10, where the security checks default to OFF in release mode (and have been reorganized into the easier-to-understand _ITERATOR_DEBUG_LEVEL setting).

• > In this case, you're thinking about vector<unique_ptr<T>> or vector<shared_ptr<T>>.

Thanks for the pointers!

1. I really dislike the code cout << "(" << i-> first <<  …  I would rather read it as cout << (*i), or even cout << (map) for the whole loop, because this is what you essentially are trying to do.
2. Thanks for sharing your thoughts about IOStreams.  This is the part designed by His Majesty Bjarne I, and if I did not know that I would say it was written by an electical engineer.  Isn´t it time to do something about that mess?
3. I am missing the possibility to use std::set <int [02], Predicate>.
4. I am missing the possibliity to replace the key with another one that compares equal to the original.  This is an action that would not damage the structure.
• The key point is, when your input length is a priori limited to 8.000, every algorithm (of a finite set) runs in constant time, except for the algorithms that run forever

• Please, efficiency is not the reason millions of people run Microsoft software worldwide.  There are various reasons, and efficiency is not one of them (think paperclip).