Project 1: Weighted Call Graphs

This project will help you get acquainted with using infrastructures like LLVM to gather basic information about computer programs. You will also gain experience recognizing limitations and trade-offs made when designing and constructing a static analysis tool.

For this project, you will construct an LLVM tool that can compute and output a static call graph for an input program. You may notice that LLVM already has some functionality for computing and printing call graphs; however, the graphs that LLVM computes do not provide the same information that you will be required to present in this project.

As a reminder, a call graph is a directed graph where the nodes represent the functions within a program. An edge exists from foo() to bar() when foo() may call bar(). Such call graphs can be helpful for examining the structure of a program, and they are also a crucial first step in many other analyses. They are especially useful for understanding programs with indirect calls or function pointers. For such programs, a single call site in a program may call many different functions across different program executions or even within a single execution of a program. These graphs can also be made more informative by noting the precise call sites or lines in foo() that make calls to bar().

The call graphs that you construct for this project shall show the possibly called functions for each call site within a program. In addition, they shall show the weight of a function in the call graph, the number of incoming edges to that particular function. For example, for the simple program:

 1 void foo();
 3 void bar() {
 4   foo();
 5   bar();
 6 }
 8 void baz() {
 9   foo();
10   bar();
11 }
13 int main(int argc, char **argv) {
14   foo();
15   bar();
16   baz();
17   void (*bam)() = 0;
18   switch (argc%3) {
19    case 0: bam = foo; break;
20    case 1: bam = bar; break;
21    case 2: bam = baz; break;
22   }
23   bam();
24   return 0;
25 }

Your program shall produce Graphviz formatted output like:

digraph {
  node [shape=record];
  f0x12b8e70[label="{bar|Weight: 4|<l0>example.c:5|<l1>example.c:6}"];
  f0x12b9630[label="{foo|Weight: 4}"];
  f0x12b8f00[label="{baz|Weight: 2|<l0>example.c:10|<l1>example.c:11}"];
  f0x12b9030[label="{main|Weight: 0|<l0>example.c:15|<l1>example.c:16|<l2>example.c:17|<l3>example.c:24}"];
  f0x12b8e70:l0 -> f0x12b9630;
  f0x12b8e70:l1 -> f0x12b8e70;
  f0x12b8f00:l0 -> f0x12b9630;
  f0x12b8f00:l1 -> f0x12b8e70;
  f0x12b9030:l0 -> f0x12b9630;
  f0x12b9030:l1 -> f0x12b8e70;
  f0x12b9030:l2 -> f0x12b8f00;
  f0x12b9030:l3 -> f0x12b9630;
  f0x12b9030:l3 -> f0x12b8f00;
  f0x12b9030:l3 -> f0x12b8e70;

Instructions for producing this output are provided in the template. a:b -> c means that the call site b in function a may call function c. a[...] specifies the name, weight, and call sites of function a. I used hex strings to identify individual functions, but you may use whatever identifiers you like. That produces the following call graph:

Note that each node in the graph contains the name of the function that it represents along with weight of the function and a list of the line numbers of call sites in the function. Edges connect the call sites to their potential call targets.

Issues to keep in mind

IR Intrinsics – LLVM inserts calls to some functions in order to represent information within the IR. In particular, some debugging information is anchored by calls to functions that have names starting with llvm.dbg. You should ignore these functions in your callgraph.

Recursion – Both direct and mutual recursion must work correctly. For this project, recursion shouldn't pose any special problems, but it is always a useful corner case to bear in mind.

Disconnected graphs – Not every function may be reachable from the main function. As a result, the call graph may form several disconnected components. Your implementation must be able to handle this.

Pointers – Handling indirect function calls inherently leads to imprecision. You must select and implement one approach for constructing a call graph even in the presence of function pointers. On top of learning the basics of LLVM, this issue poses the greatest challenge for the project. There are several different approaches that you may take for trying to compute a more precise call graph even in the presence of function pointers. In (I think) increasing order of difficulty, some possible approaches are:

The only approach you may not use is the naive method of assuming that any indirect call may point to any function that has its address taken. You should make sure that you understand the trade offs of the particular approach that you use. You will be expected to discuss them in a brief document when you turn your project in.

What I provide

I have created a basic template for the project that includes the Graphviz formatted output. This template can be used to create an LLVM project that compiles using either configure or cmake. The tools directory contains source for a simple program called cggen that takes in a single bitcode file, runs the analysis that you will write, and prints the resulting callgraph. The libs directory contains a template in CallGraph.cpp for the analysis that you will write in order to consruct a call graph. The header for the analysis is in include/CallGraph.h. Feel free to modify these sources as much as you wish.

There are a few different ways that you may build the project. I recommend using whichever approach is most familiar to you.

Option 1) Building using autoconf

First, you must build LLVM as described here and skipping steps 5, 6, and 7. Make sure to run configure with the flags --disable-optimized --enable-debug-runtime --enable-assertions. Next, follow the instructions for building LLVM projects:
cd autoconf
Enter the paths to the LLVM source and build directories.
From a separate build directory, run:
<path to project source>/configure --with-llvmsrc=<path to llvm src dir> --with-llvmobj=<path to llvm build dir> --disable-optimized --enable-debug-runtime --enable-assertions
Then you can simple run make to build the project within the build directory. The cggen program will reside at Debug+Asserts/bin/cggen in the build directory.

Option 2) Building using cmake

First, you must build and install LLVM as described here. Make sure to run cmake with -DCMAKE_BUILD_TYPE=Debug. You can use -DCMAKE_INSTALL_PREFIX=<path to local install directory> to install llvm to a local path. From a separate build directory for the project, run:
CC=clang CXX=clang++ cmake -DLLVM_ROOT=<installation path of LLVM> <path to project source>
Then you can simple run make to build the project within the build directory. The cggen program will reside at tools/cggen/cggen in the build directory.

What you turn in

You must turn in two things. Make sure that your name is clearly specified in both.

  1. The source files for your project. Compress your project directory into a .tar.gz file in order to send it to me. I will build and test all projects on a 64-bit machine. Make sure that I can test your project by simply running the cggen program that gets built by the provided template. If you have used an external alias analysis component, include that as well.

  2. A brief (1 page) write up of your project that explains your basic design as well as the limitations and advantages of your approach. In particular, explain how you dealt with function pointers and how this relates to false positives and false negatives with respect to the 'may call' relationship (->) that the call graph captures. Feel free to include any other points of interest, such as trade offs between design complexity and precision.

Both of these should be emailed to me by the deadline.


Monday, January 27 at 11:59pm