Analysis of C++ Programs at Scale

Analysis of C++ Programs at ScalePart 1 - Building a Clang Tool

As C++ projects grow in size, reasoning about their behavior becomes increasingly difficult due to several factors, including:

Language features, such as templates, multiple inheritance, and implicit conversions which make the codebase more difficult to parse and understand, especially for automated tools.
Preprocessor directives which are commonly used for conditional compilation (i.e. #ifndef inside a .h file which is pulled in via an #include). This makes the actual code which is being compiled more opaque, and leads to dependence on code which can only be maintained with macros going forward.
Developers must manually reason about memory management, which creates opportunity for efficiency benefits, but also adds the complexity of ensuring correct memory allocation, deallocation, and avoiding memory leaks and dangling pointers. Building tools to automatically refactor code becomes a landmine.
There is no standardized detection or prevention of thread synchronization errors, race conditions, nor deadlocks. Some compilers have optional features to do this, but it's entirely optional and you need to know how to annotate your code to benefit.
Large C++ projects often contain multiple sedimentary layers of legacy code which uses the best practice which was in vogue at the time the code was written. The difficulty of building correct tooling prevents this code from being automatically refactored. This can also include code which was written to take advantage of some benefit on a particular architecture which historically could not be automatically done by compilers (e.g. Duff's Device)
C++ projects can be built and run on a wide range of platforms using different build systems.

These challenges necessitate advanced techniques, such as those provided by the LLVM frontend Clang, to effectively analyze and understand C++ programs at scale. This blog post is the first part of a multi-part series aimed at demonstrating how Clang can be used for large scale analysis of C++.

Building a Clang Tool

The core of using Clang as a library is logic which operates on the Abstract Syntax Tree (AST). This data structure represents the enriched structure of the C++ source code as written by the programmer. This enrichment removes abiguities and simplifies the job of building tools that work with C++.

The AST itself is built up of a collection of objects which can be grouped into roughly 5 families:

Decl is used to represent any declaration / definitions. For example, a VarDecl is any variable declaration, a FunctionDecl is any function declaration, and RecordDecl is any class, struct, or union.
Type is used to represent abstract types. For example, a PointerType represents a pointer of its child type, and BuiltinType represents any built-in type such as int or float.
Stmt is used to represent statements and expressions inside of a function. For example an IfStmt has a condition Expr, a mandatory then Stmt and an optional else Stmt.
TypeLoc are used to connect the abstract concept of a Type to a specific location in the source code where that type is written down. For example, a UsingDecl (which is used for a using declaration) will contain two TypeLocs: One for the underlying type, and one for the new alias. These each can be used to obtain the associated Type and the SourceLocation where the TypeLoc appears in the source.
Attr are used to represent any compiler attribute which is annotated onto another node in the AST. These are used to implement various compiler extensions.

The sheer variety of different subclasses which exist in the AST is overwhelming at first. Take a deep breath. This is OK. The organization is highly logical, and it just takes a bit of patience to click around Doxygen. In the future I'd like to put together a more detailed Pokédex of these classes.

Now that we have a rough idea of what the AST is, let's talk about a couple ways to get clang to give it to us.

Running as a Standalone Tool

One approach is to start with classes like clang::CompilerInstance, clang::CompilerInvocation which set up a a totally custom compilation. Doing this is very instructive for understanding how clang works, but it is not particularly approachable. If your goal is to understand clang, this should be high on your to-do list. However, if your goal is to understand C++, you can avoid this for a long time by building a plugin instead.

Thus, we're not going deep into this topic today, but for those who insist on this approach, you can start by looking at the implementation for runToolOnCode.

Running as a Plugin

Instead of getting bogged down with the gritty details of setting up a clang::CompilerInvocation and building the AST by hand, we'll take a shortcut by hooking into the existing compilation process. To do this, we'll create a clang::PluginASTAction and register it with the clang::FrontendPluginRegistry. This makes the plugin available for use on the command line. Then, all it takes is adding the -fplugin=path/to/plugin.so flag onto your Clang command line, and voilà.

Let’s walk through a basic example. Everything can go into a single .cpp file. Let's call it dump_ast.cc. The full example is available on GitHub.

First, a brief interlude on C++ compilation:

C++ implements compilation in the same manner as C. In this model, each translation unit ( i.e. .cc or .cpp) is compiled in isolation from the others into a .o file which contains symbols which are included in that file. The contents of header files are included as text in multiple translation units as if it were written directly in those files.

Once all files are compiled into .o files, they can be combined together by the linker into shared libraries, static libraries, or binaries. For this reason, when we apply our plugin to C++ code, the AST which our process "sees" will correspond to a single translation unit at a time.

Combining ASTs from multiple translation units is possible, but we won't get into that until a later part in this series.

Okay, back to the implementation. First, we include a couple headers. We’ll get into what these do in a moment.

#include <clang/AST/ASTConsumer.h>
#include <clang/Frontend/FrontendPluginRegistry.h>

First we need a subclass of clang::ASTConsumer. This is the interface which clang uses to represent anything which can be handed a clang::ASTContext. For our purposes here we'll simply override the HandleTranslationUnit method and have it call the dump() method on the ASTContext's TranslationUnitDecl. This will cause a text representation of the AST to be written to STDOUT. Note: this will create a lot of output -- even for tiny example programs.

struct MyASTConsumer : public clang::ASTConsumer {
  void HandleTranslationUnit(clang::ASTContext& ctx) override {
    ctx.getTranslationUnitDecl()->dump();
  }
};

Next, we create a clang::PluginASTAction. Think of it as a factory for MyASTConsumer, with some extras for parsing plugin arguments (done with -fplugin-arg-<plugin_name>-<arg>=...) and specifying how the plugin plays with the rest of the build. For this quick example, it’s all boilerplate:

struct MyPluginAction : public clang::PluginASTAction {
  virtual ~MyPluginAction() {}

  std::unique_ptr<clang::ASTConsumer> CreateASTConsumer(
      clang::CompilerInstance& CI, llvm::StringRef InFile) override {
    return std::make_unique<MyASTConsumer>();
  }

  bool ParseArgs(const clang::CompilerInstance& CI,
                 const std::vector<std::string>& arg) override {
    return true;
  }

  ActionType getActionType() override { return AddAfterMainAction; }
};

To wrap things up, we declare a variable using the clang::FrontendPluginRegistry to add our plugin to the registry. When our shared library is loaded, clang will use this machinery to find our plugin.

static clang::FrontendPluginRegistry::Add<MyPluginAction> X(
    "MyPlugin",
    "Does interesting things with an AST.");

That's it! Now we can compile this into a new shared library with:

clang++ -c -shared -fPIC dump_ast.cc -o dump_ast.so

And then run it on some arbitrary C++ file (see the github for this specific example):

clang++ -fplugin=dump_ast.so test.cc

You should get some output that looks like this:

...

-FunctionDecl 0x2cd17e0 <test.cc:3:1, line:6:1> line:3:5 main 'int (int, char **)'
  |-ParmVarDecl 0x2cd15d0 <col:10, col:14> col:14 argc 'int'
  |-ParmVarDecl 0x2cd16c0 <col:20, col:31> col:26 argv 'char **':'char **'
  `-CompoundStmt 0x2cd64e8 <col:34, line:6:1>
    `-CXXOperatorCallExpr 0x2cd6480 <line:5:2, col:41> 'std::basic_ostream<char>::__ostream_type':'std::basic_ostream<char>' lvalue '<<'
      |-ImplicitCastExpr 0x2cd6468 <col:33> 'std::basic_ostream<char>::__ostream_type &(*)(std::basic_ostream<char>::__ostream_type &(*)(std::basic_ostream<char>::__ostream_type &))' <FunctionToPointerDecay>
      | `-DeclRefExpr 0x2cd63f0 <col:33> 'std::basic_ostream<char>::__ostream_type &(std::basic_ostream<char>::__ostream_type &(*)(std::basic_ostream<char>::__ostream_type &))' lvalue CXXMethod 0x2c49f98 'operator<<' 'std::basic_ostream<char>::__ostream_type &(std::basic_ostream<char>::__ostream_type &(*)(std::basic_ostream<char>::__ostream_type &))'
      |-CXXOperatorCallExpr 0x2cd5820 <col:2, col:15> 'basic_ostream<char, std::char_traits<char>>':'std::basic_ostream<char>' lvalue '<<' adl
      | |-ImplicitCastExpr 0x2cd5808 <col:12> 'basic_ostream<char, std::char_traits<char>> &(*)(basic_ostream<char, std::char_traits<char>> &, const char *)' <FunctionToPointerDecay>
      | | `-DeclRefExpr 0x2cd5788 <col:12> 'basic_ostream<char, std::char_traits<char>> &(basic_ostream<char, std::char_traits<char>> &, const char *)' lvalue Function 0x2c53448 'operator<<' 'basic_ostream<char, std::char_traits<char>> &(basic_ostream<char, std::char_traits<char>> &, const char *)'
      | |-DeclRefExpr 0x2cd18f8 <col:2, col:7> 'std::ostream':'std::basic_ostream<char>' lvalue Var 0x2cd0fc8 'cout' 'std::ostream':'std::basic_ostream<char>'
      | `-ImplicitCastExpr 0x2cd5770 <col:15> 'const char *' <ArrayToPointerDecay>
      |   `-StringLiteral 0x2cd1928 <col:15> 'const char[15]' lvalue "Hello, world!\n"
      `-ImplicitCastExpr 0x2cd63d8 <col:36, col:41> 'basic_ostream<char, std::char_traits<char>> &(*)(basic_ostream<char, std::char_traits<char>> &)' <FunctionToPointerDecay>
        `-DeclRefExpr 0x2cd63a0 <col:36, col:41> 'basic_ostream<char, std::char_traits<char>> &(basic_ostream<char, std::char_traits<char>> &)' lvalue Function 0x2c4e6d8 'endl' 'basic_ostream<char, std::char_traits<char>> &(basic_ostream<char, std::char_traits<char>> &)' (FunctionTemplate 0x2c313c8 'endl')

You should now be able to use the doxygen to figure out what this does. Here are some helpful links to the relevant parts: FunctionDecl, ParmVarDecl, CompoundStmt, CXXOperatorCallExpr, ImplicitCastExpr, DeclRefExpr, StringLiteral

Next Time

In Part 2, we'll discuss ways to serialize this AST data structure so that we can index interesting parts of the data without needing to recompile the code each time we wish to access it.