16 Nov, 2015

True Story: Efficient Packing

Variadic templates made its way into the C++ lands with C++11, in the form of parameter packs. A parameter pack is a parameter that accepts zero or more arguments; a template with at least one parameter pack is called a variadic template. What follows is a story on how to use them efficiently...

The Problem

The spawn function schedules a callable for later execution. Worker threads in the background are constantly looking for new items to execute.

template <typename F>
void spawn(F&& f);

While standard facilities like std::thread and std::async take an arbitrary number of arguments, the spawn interface is less accommodating. It provides only the means to schedule nullary callables, having the caller jump through the hoops of bundling any arguments. Lambdas and std::bind can be a viable approach in simple cases, but they are cumbersome to deal with in a generic context —as exposed in Moving Past Bind—. Enter deferred_call, a utility function that stores a callable and its arguments for later invocation:

template <typename T>
class deferred;

template <typename F, typename ...Args>
class deferred<F(Args...)> {
  std::decay_t<F> _f;
  std::tuple<std::decay_t<Args>...> _args;
  // std::decay_t<Args>... _args; is not valid C++

  using result_type =
    std::result_of_t<std::decay_t<F>(std::decay_t<Args>...)>;

public:
  deferred(F&& f, Args&&... args)
    : _f(std::forward<F>(f))
    , _args(std::forward<Args>(args)...)
  {}

  result_type operator()() && {
    /* unpack arguments somehow */
  }
};

template <typename F, typename ...Args>
deferred<F&&(Args&&...)> deferred_call(F&& f, Args&&... args) {
  return {std::forward<F>(f), std::forward<Args>(args)...};
}

The effects of a deferred call are those of INVOKE(DECAY_COPY(std::forward<F>(f)), DECAY_COPY(std::forward<Args>(args))...), with the calls to DECAY_COPY being evaluated during construction and the invocation happening at a later time —where the original arguments might no longer exist, hence the decay-copies—. It is a frequent pattern in concurrent code, where actual invocation of a callable happens later, possibly in a different execution context.

With it, a nicer interface for spawn is possible:

// A nicer interface for a simple spawn
template <typename F, typename ...Args>
void nicer_spawn(F&& f, Args&&... args) {
  spawn(deferred_call(std::forward<F>(f), std::forward<Args>(args)...));
}

Since spawn will too DECAY_COPY the callable, this introduces a tinsy bit of extra moves and/or copies. Those extra operations will hopefully be negligible, as they are unavoidable if spawn is to stick to its narrow interface. All that remains now is implementing deferred's function call operator. Packs cannot be stored directly as members, so they are stored in a std::tuple instead, but how to unpack the arguments when the time for invocation comes?

The Solution

In the old days, variadic templates were approximated —often with help from the preprocessor— up to a hard limit. Recursion was then a go-to technique to operate on such faux variadic templates. Each recursive call would unpack the I-th element from the tuple of arguments, and once there's nothing left to unpack the actual invocation would happen:

template <std::size_t I>
struct index {};

template <typename F, typename ...Args>
class deferred<F(Args...)> {
  /*...*/

  template <std::size_t I, typename ...Ts>
  result_type _invoke(index<I>, Ts&&... vs) {
    // take one down, pass it around
    return _invoke(index<I + 1>{},
      std::forward<Ts>(vs)..., std::get<I>(std::move(_args)));
  }

  template <typename ...Ts>
  result_type _invoke(index<sizeof...(Args)>, Ts&&... vs) {
    // no more, go to the store
    return std::invoke(
      std::move(_f), std::forward<Ts>(vs)...);
  }

public:
  result_type operator()() && {
    return _invoke(index<0>{});
  }
};

This is a fairly complicated task and the end result is not even a very natural way to write the desired operation. That says N3658, which introduced compile-time integer sequences into C++14. They provide the grounds onto which a more natural —recursion free— solution can be built:

template <typename F, typename ...Args>
class deferred<F(Args...)> {
  /*...*/

  template <std::size_t ...Is>
  result_type _invoke(std::index_sequence<Is...>) {
    return std::invoke(
      std::move(_f), std::get<Is>(std::move(_args))...);
  }

public:
  result_type operator()() && {
    return _invoke(std::index_sequence_for<Args...>{});
  }
};

By introducing an index parameter pack, it is possible to treat the tuple of arguments as if it were a pack again. The expansion std::get<Is>(tuple)... conceptually maps a pack of indices into a pack of tuple elements.

The proposal goes as far as showcasing the invocation of a function with arguments from a tuple as an example of what compile-time integer sequences are good for. A further proposal, N3915, makes this example normative by introducing apply into the Library Fundamentals TS (v1):

template <typename F, typename ...Args>
class deferred<F(Args...)> {
  /*...*/

public:
  result_type operator()() && {
    return experimental::apply(
      std::move(_f), std::move(_args));
  }
};

C++ keeps getting better and better!

The case of the index_sequence

If all an index_sequence did was just shift recursive helper infrastructure from user code into library code, then it would be a good thing already. However, given the central role it plays, it's worth looking at its implementation in more detail. The standard defines a more general std::integer_sequence first, suitable for any integer type; then it special cases std::size_t with aliases like std::index_sequence for the one use case the facility was designed to address. Given that only index sequences are of interest here, the existence of general integer sequences can safely be ignored.

template <std::size_t ...Is>
struct index_sequence {};

template <std::size_t N>
struct _make_index_sequence {
  using type = /*...*/;
};

template <std::size_t N>
using make_index_sequence = typename _make_index_sequence<N>::type;

template <typename ...Ts>
using index_sequence_for = make_index_sequence<sizeof...(Ts)>;

A linear approach

What follows is a naïve linear implementation:

template <std::size_t N, typename Is = index_sequence<>>
struct _make_index_sequence;

template <std::size_t N, std::size_t ...Is>  // <0, 1, 2>
struct _make_index_sequence<N, index_sequence<Is...>>
  : _make_index_sequence<
      N - 1,
      index_sequence<Is..., sizeof...(Is)>  // 0, 1, 2, 3
    >
{};

template <std::size_t ...Is>
struct _make_index_sequence<0, index_sequence<Is...>> {
  using type = index_sequence<Is...>;
};

Like in the old days, each step along the recursion way does one bit of the work. This implementation is linear in the number of template instantiations required, which is a good first approximation to compilation times —despite considering all instantiation costs to be equal—. For example, requesting make_index_sequence<7> requires the following template instantiations:

  • _make_index_sequence<7, index_sequence<>>
  • _make_index_sequence<6, index_sequence<0>>
  • _make_index_sequence<5, index_sequence<0, 1>>
  • _make_index_sequence<4, index_sequence<0, 1, 2>>
  • _make_index_sequence<3, index_sequence<0, 1, 2, 3>>
  • _make_index_sequence<2, index_sequence<0, 1, 2, 3, 4>>
  • _make_index_sequence<1, index_sequence<0, 1, 2, 3, 4, 5>>
  • _make_index_sequence<0, index_sequence<0, 1, 2, 3, 4, 5, 6>>

It's simple, it works... it's also as bad an implementation as they come.

When a template specialization is conjured in a context that requires a complete type —as in those T::type scenarios—, the compiler will look for an existing instantiation of such specialization; if it cannot find one, then it will manufacture it via implicit instantiation. What this means is that within a translation unit, the compiler will never need to instantiate the same template specialization twice. Doing so would be a waste of time, the results would be identical —they better be—, as the linker will eventually get to removing all these duplicates anyway. This is known as memoization.

This naïve implementation is memoization unfriendly. For example, a successive request for make_index_sequence<8> requires the following template instantiations:

  • _make_index_sequence<8, index_sequence<>>
  • _make_index_sequence<7, index_sequence<0>>
  • _make_index_sequence<6, index_sequence<0, 1>>
  • _make_index_sequence<5, index_sequence<0, 1, 2>>
  • _make_index_sequence<4, index_sequence<0, 1, 2, 3>>
  • _make_index_sequence<3, index_sequence<0, 1, 2, 3, 4>>
  • _make_index_sequence<2, index_sequence<0, 1, 2, 3, 4, 5>>
  • _make_index_sequence<1, index_sequence<0, 1, 2, 3, 4, 5, 6>>
  • _make_index_sequence<0, index_sequence<0, 1, 2, 3, 4, 5, 6, 7>>

By doing the work on the way down memoization gets clogged. Intermediate results, which are distinct for each N, get pushed as arguments in a way that prevents previous computations from being reused. For memoization to do its job, the work has to be done on the way up instead. What follows is a better linear implementation:

template <typename Is>
struct _next;

template <size_t ...Is>  // <0, 1, 2>
struct _next<index_sequence<Is...>> {
  using type = index_sequence<Is..., sizeof...(Is)>;  // 0, 1, 2, 3
};

template <size_t N>
struct _make_index_sequence
  : _next<typename _make_index_sequence<N - 1>::type>
{};

template <>
struct _make_index_sequence<1> {
  using type = index_sequence<0>;
};

template <>
struct _make_index_sequence<0> {
  using type = index_sequence<>;
};

This implementation is still linear, and it even does a little bit more work than the previous one. For example, requesting make_index_sequence<7> requires the following template instantiations:

  • _make_index_sequence<7>, _next<index_sequence<0, 1, 2, 3, 4, 5>>
  • _make_index_sequence<6>, _next<index_sequence<0, 1, 2, 3, 4>>
  • _make_index_sequence<5>, _next<index_sequence<0, 1, 2, 3>>
  • _make_index_sequence<4>, _next<index_sequence<0, 1, 2>>
  • _make_index_sequence<3>, _next<index_sequence<0, 1>>
  • _make_index_sequence<2>, _next<index_sequence<0>>
  • _make_index_sequence<1>

In exchange, intermediate results are leveraged by further requests. For example, a successive request for make_index_sequence<8> requires the following template instantiations:

  • _make_index_sequence<8>, _next<index_sequence<0, 1, 2, 3, 4, 5, 6>>
  • _make_index_sequence<7> (M)

This is much better, and it is possible to further improve on this implementation —one that generates indices from both ends would require approximately half as many template instantations—, but they would still be linear like in the old days.

A logarithmic approach

With the advent of variadic templates, new approaches become available. What follows is an implementation that leverages them to make for a logarithmic implementation:

template <typename Is, typename Js>
struct _join;

template <std::size_t ...Is, std::size_t ...Js>
struct _join<index_sequence<Is...>, index_sequence<Js...>> {  // <0, 1, 2>, <0, 1, 2, 3>
  using type =
    index_sequence<
      Is...,  // 0, 1, 2,
      (sizeof...(Is) + Js)...  // 3 + 0, 3 + 1, 3 + 2, 3 + 3
    >;
};

template <std::size_t N>
struct _make_index_sequence
  : _join<
      typename _make_index_sequence<N / 2>::type,
      typename _make_index_sequence<N - N / 2>::type
    >
{};

template <>
struct _make_index_sequence<1> {
  using type = index_sequence<0>;
};

template <>
struct _make_index_sequence<0> {
  using type = index_sequence<>;
};

This implementation doubles the size of the generated index sequence in each recursion step. For example, requesting make_index_sequence<7> requires the following template instantiations:

  • _make_index_sequence<7>, _join<index_sequence<0, 1, 2>, index_sequence<0, 1, 2, 3>>
  • _make_index_sequence<4>, _join<index_sequence<0, 1>, index_sequence<0, 1>>
  • _make_index_sequence<3>, _join<index_sequence<0>, index_sequence<0, 1>>
  • _make_index_sequence<2>, _join<index_sequence<0>, index_sequence<0>>
  • _make_index_sequence<1>

If that doesn't look impressive in comparison —only 6 and 5 are skipped—, consider the case of make_index_sequence<128> instantiating the machinery just for 128, 64, 32, 16, 8, 4, 2, 1. Although intermediate results are now scattered, memoization is still exploited. For example, a successive request for make_index_sequence<8> requires the following template instantiations:

  • _make_index_sequence<8>, _join<index_sequence<0, 1, 2, 3>, index_sequence<0, 1, 2, 3>>
  • _make_index_sequence<4> (M)

and a request for make_index_sequence<9> after that:

  • _make_index_sequence<9>, _join<index_sequence<0, 1, 2, 3>, index_sequence<0, 1, 2, 3, 4>>
  • _make_index_sequence<5>, _join<index_sequence<0, 1>, index_sequence<0, 1, 2>>
  • _make_index_sequence<4> (M)
  • _make_index_sequence<3> (M)
  • _make_index_sequence<2> (M)

It should be noted that every instantiation follows one of two possible patterns:

template <std::size_t K>
struct _make_index_sequence<K*2> // not valid C++
  : _join<
      typename _make_index_sequence<K>::type,
      typename _make_index_sequence<K>::type
    >
{};
template <std::size_t K>
struct _make_index_sequence<K*2+1> // not valid C++
  : _join<
      typename _make_index_sequence<K>::type,
      typename _make_index_sequence<K+1>::type
    >
{};

For an even N, the same index sequence for K has to be computed twice, but memoization guarantees the hard work is only paid for once. For an odd N, on the other hand, index sequences for both K and K+1 have to be computed. It might be possible to even out the odds. What follows is an implementation that attempts to distribute the workload:

template <typename Is, bool Odd>
struct _dupl;

template <std::size_t ...Is>  // <0, 1, 2>
struct _dupl<index_sequence<Is...>, false> {
  using type =
    index_sequence<
      Is...,    // 0, 1, 2,
      (sizeof...(Is) + Is)...  // 3 + 0, 3 + 1, 3 + 2
    >;
};

template <std::size_t ...Is>  // <0, 1, 2>
struct _dupl<index_sequence<Is...>, true> {
  using type =
    index_sequence<
      Is...,  // 0, 1, 2,
      (sizeof...(Is) + Is)...,  // 3 + 0, 3 + 1, 3 + 2,
      sizeof...(Is) * 2  // 3 * 2
    >;
};

template <std::size_t N>
struct _make_index_sequence
  : _dupl<
      typename _make_index_sequence<N / 2>::type,
      N % 2 != 0
    >
{};

template <>
struct _make_index_sequence<1> {
  using type = index_sequence<0>;
};

template <>
struct _make_index_sequence<0> {
  using type = index_sequence<>;
};

Each recursion step now only has to compute a single index sequence. For example, requesting make_index_sequence<7> requires the following template instantiations:

  • _make_index_sequence<7>, _dupl<index_sequence<0, 1, 2>, true>
  • _make_index_sequence<3>, _dupl<index_sequence<0>, true>
  • _make_index_sequence<1>

Intermediate results are now even more scattered, as there are fewer of them, but memoization is still exploited. For example, a successive request for make_index_sequence<8> requires the following template instantiations:

  • _make_index_sequence<8>, _dupl<index_sequence<0, 1, 2, 3>, false>
  • _make_index_sequence<4>, _dupl<index_sequence<0, 1>, false>
  • _make_index_sequence<1> (M)

and a request for make_index_sequence<9> after that:

  • _make_index_sequence<9>, _dupl<index_sequence<0, 1, 2, 3>, true>
  • _make_index_sequence<4> (M)

It is still possible to further improve on this implementation —one that unrolls this approach onto a base-8 log has been spotted in the wild—.

A note on compilation times

The number of template instantiations is not the only factor contributing to compilation times —nor is it necessary a good measure when the cost of instantiations at play differs much—. Memoization, as previously detailed, implies a lookup, and that has a cost too...

During the stone age, when compilers disregarded templates as a fringe feature, the lookup required walking down a chain of instantiations. This means that the N-th instantiation would have walked a chain of N - 1 elements, thus leading to superlinear lookup times —N*(N-1)/2—. The cost of this lookup was so high that the number of instantiations alone —whatever their cost— was a defining factor of compilation times.

At the time of this writing, compilers have been avoiding such an approach for years. A popular approach is that of using a hash map, keyed on the mangled name of the entity, which gives an average constant time lookup —on the number of template instantiations—. This, however, means that mangled names have to be kept around for anything ever instantiated; even for things like _make_index_sequence which is unlikely to ever make it into the linking stage. As mangled names get long —really long, like those generated by expression templates—, they put more and more pressure on memory utilization, as well as a bit on mangling and hashing.

When looking back at those _make_index_sequence implementations, it is important to know that most ABIs have some form or another of compression. When a partial name or symbol is repeated, it will likely be represented by a back reference instead. Conceptually, _join<index_sequence<0, 1>, index_sequence<0, 1>> is mangled as _join<index_sequence<0, 1>, \1>.

When it comes to performance, the responsible thing to do is to measure. Your mileage may vary...

The case of the tuple

Tuple is the last variadic stand. It too was implemented via recursion in the old days, each recursion step would store one element of the tuple and inherit from a tuple with the remaining ones. Unsurprisingly, a variadic tuple implementation exploits the same techniques previously explored, but does so in new places:

template <std::size_t I, typename T>
struct _tuple_leaf { // EBO candidate, http://talesofcpp.fusionfenix.com/post-18/episode-ten-when-size-does-matter
  T _value;
};

template <typename Is, typename ...Ts>
struct _tuple_impl;

template <std::size_t ...Is, typename ...Ts>
struct _tuple_impl<index_sequence<Is...>, Ts...>
  : _tuple_leaf<Is, Ts>... // base-specifier expansion
{
  template <typename ...Us>
  _tuple_impl(Us&&... vs)
    : _tuple_leaf<Is, Ts>(std::forward<Us>(vs))... // member-initializer expansion
  {}

  /*...*/
};

template <typename ...Ts>
class tuple {
  _tuple_impl<index_sequence_for<Ts...>, Ts...> _impl;

  /*...*/
};

Like in previous examples, recursion can be eliminated by the introduction of a helper index pack. Storage for all elements in the tuple is obtained via a single pack expansion in the base-clause. The seemingly redundant index for each tuple leaf is necessary, as there might be duplicated types in Ts..., and duplicate direct bases are invalid.

Once the index is part of the leaf type, the task of getting ahold of the I-th member can be delegated to overload resolution:

template <std::size_t ...Is, typename ...Ts>
struct _tuple_impl<index_sequence<Is...>, Ts...>
  : _tuple_leaf<Is, Ts>...
{
  /*...*/

  template <std::size_t I, typename T>
  static _tuple_leaf<I, T>& _select(
    _tuple_leaf<I, T>& leaf) noexcept {
    return leaf;
  }

  template <std::size_t I>
  auto& get() noexcept {
    // let overload resolution take its course
    return _select<I>(*this)._value;
  }
};

Since an explicit template argument I is given to select, only T needs to be deduced. The tuple internal implementation can unambiguously bind to a reference to _tuple_leaf<I, T> for an explicitly given I and a deduced T, since it inherits from every leaf and each leaf has a distinct index.

A similar approach can be taken to implement tuple_element, noting that it should not instantiate any of the tuple element types. A more general approach is needed:

template <std::size_t I, typename T>
struct _indexed {
  using type = T;
};

template <typename Is, typename ...Ts>
struct _indexer;

template <std::size_t ...Is, typename ...Ts>
struct _indexer<index_sequence<Is...>, Ts...>
  : _indexed<Is, Ts>...
{};

template <std::size_t I, typename ...Ts>
struct _at_index {
  template <typename T>
  static _indexed<I, T> _select(_indexed<I, T>);

  using _impl = _indexer<index_sequence_for<Ts...>, Ts...>;
  using type = typename decltype(_select(_impl{}))::type;
};

With this reusable solution, implementing tuple_element is straightforward:

template <std::size_t I, typename Tuple>
struct tuple_element;

template <std::size_t I, typename ...Ts>
struct tuple_element<I, tuple<Ts...>>
  : _at_index<I, Ts...>
{};

The job is done, true variadics all the way down!