Project: Callgraph Profiler

This project will help you get acquainted with using infrastructures like LLVM to gather basic information about computer programs. You will also gain some experience recognizing limitations and trade-offs made when designing and constructing a dynamic analysis tool.

For this project, you will construct an LLVM tool that can compute and output a weighted dynamic call graph for an execution of an input program.

As a reminder, a call graph is a directed graph where the nodes represent the functions within a program. An edge exists from foo() to bar() when foo() may call bar(). Such call graphs can be helpful for examining the structure of a program, and they are also a crucial first step in many other analyses. They are especially useful for understanding programs with indirect calls or function pointers. For such programs, a single call site in a program may call many different functions across different program executions or even within a single execution of a program. These graphs can also be made more informative by noting the precise call sites or lines in foo() that make calls to bar().

The dynamic weighted call graphs that you construct for this project shall show the observed call targets for each call site within an execution of the program. In addition, they shall show the weight of an edge in the call graph, the number of times that particular function was the target of that call site. For example, consider the simple program:

example.c
#include <stdio.h>


void
foo(int i) {
  printf("Blimey!\n");
}


void
bar(int i) {
  if (i > 0) {
    bar(i - 1);
  }
}


int
main(int argc, char** argv) {
  for (unsigned i = 0; i < 10; ++i) {
    void (*fptr)(int) = (i % argc) ? foo : bar;
    fptr(i);
  }
  return 0;
}

Then running the commands:

clang -S -emit-llvm -g example.c
callgraph-profiler example.ll -o example
./example 2 3

shall produce a CSV (Comma Separated Value) file called profile-results.csv in the current directory that looks like:

profile-results.csv
foo, example.c, 6, printf, 6
bar, example.c, 13, bar, 18
main, example.c, 22, bar, 4
main, example.c, 22, foo, 6

This CSV format encodes each edge in the dynamic call graph on a separate line. The format of a single line is:

<caller function name>, <call site file name>, <call site line #>, <callee function name>, <(call site,callee) frequency>

The last column on the line is the number of times that the callee (the call target) was called from that particular call site.

The CSV format is easy for you to produce and easier for me to grade, but it is perhaps challenging to understand. You are also provided with a script that can automatically convert the CSV format into a Graphviz formatted output like:

example.gv
digraph {
  node [shape=record];
  "main"[label="{main|<l0>example.c:22}"];
  "main":l0 -> "bar" [label="4",penwidth="1.11",labelfontcolor=black,color="#380000"];
  "main":l0 -> "foo" [label="6",penwidth="1.67",labelfontcolor=black,color="#550000"];
  "foo"[label="{foo|<l0>example.c:6}"];
  "foo":l0 -> "printf" [label="6",penwidth="1.67",labelfontcolor=black,color="#550000"];
  "bar"[label="{bar|<l0>example.c:13}"];
  "bar":l0 -> "bar" [label="18",penwidth="5.0",labelfontcolor=black,color="#ff0000"];
  "printf"[label="{printf}"];
}

which can in turn be passed to the dot command to automatically produce pictures of the dynamic call graph like the following:

%3 printf printf main main example.c:22 bar bar example.c:13 main:l0->bar 4 foo foo example.c:6 main:l0->foo 6 bar:l0->bar 18 foo:l0->printf 6

Examples of how you can automatically produce these pictures to better understand your results are provided to you.

Issues to keep in mind

IR Intrinsics – LLVM inserts calls to some functions in order to represent information within the IR. In particular, some debugging information is anchored by calls to functions that have names starting with llvm.dbg. You should ignore these functions in your callgraph.

Recursion – Both direct and mutual recursion must work correctly. For this project, recursion shouldn't pose any special problems, but it is always a useful corner case to bear in mind.

Pointers – Function pointers allow one call site to have many different potential targets/callees. They also complicate the identification of the call target at the call site.

External Functions – Some functions, like printf(), are implemented elsewhere in a library. Thus, you have access to the call site, but you cannot modify the call target.

What I provide

I have created a basic template for the project. This template can be used to create an LLVM project that compiles using cmake similar to the LLVM demo we saw in class. The tools directory contains source for a simple program called callgraph-profiler that takes in a single bitcode file, runs the instrumentation analysis that you will write, and produces a compiled program that includes any instrumentation you have added. The libs/callgraph-profiler-inst/ directory contains a template in ProfilingInstrumentationPass.cpp for the analysis that you will write in order to consruct a call graph. The header for the analysis is in include/ProfilingInstrumentationPass.h. The runtime support library for any functions to which you wish to insert calls can be found in libs/callgraph-profiler-rt/runtime.cpp. Feel free to modify these sources as much as you wish. Some example tests are provided in the tests/ directory to help you refine your implementation. I encourage you to discuss and share additional test cases in the class discussion forum.

You can find step-by-step instructions for building the project in the README.md document in the project template.

To build this project, you should use LLVM 4.0, available on the LLVM download page. Downloading the pre-built binaries should provide you everything you need. This should already be available in the CSIL lab at /usr/shared/CMPT/wsumner/llvm/. The bin/ subdirectory also contains appropriate versions of clang and other LLVM related tools that you may use for the project.

Building using cmake

First create a build directory separate from your source directory in order to perform an out-of-source build. From within the build directory for the project, run:

cmake -DLLVM_DIR=<installation, build, or extraction path of LLVM>/lib/cmake/llvm <path to project source>

Then you can simply run make to build the project within the build directory. The callgraph-profiler program will reside at bin/callgraph-profiler in the build directory.

On CSIL in particular, you should use a recent version of clang when compiling. You can do this by configuring the build with:

CXX=/usr/shared/CMPT/wsumner/llvm/bin/clang++ \
  CC=/usr/shared/CMPT/wsumner/llvm/bin/clang \
  cmake -DLLVM_DIR=/usr/shared/CMPT/wsumner/llvm/lib/cmake/llvm <path to project source>

What you turn in

You must turn in two things. Make sure that your name is clearly specified in both.

  1. The source files for your project. Compress your project directory into a .tar.gz file in order to submit it via CourSys All projects will be built and tested on a 64-bit machine. Make sure that we can test your project by simply running the callgraph-profiler program to instrument a module and running the resulting program. Grading of the projects will be automatic, so make sure that your output format is correct!

  2. A brief (up to 1 page) write up of your project that explains your basic design as well as the limitations and advantages of your approach. In particular, explain how you dealt with function pointers and external function calls. Are there ways that you did or could have made the analysis more efficient for common cases? Feel free to include any other points of interest, such as trade offs between design complexity and notions of efficiency.

Both of these should be submitted via CourSys by the deadline.

Deadline

Monday, September 18 at 11:59pm

Additional Resources on Static Call Graphs