This project is intended to help you get experience with simple interprocedural dataflow analysis and the sorts of approximations that arise when balancing precision and scalability.
Inferring the precise value of a variable (constant propagation) can be useful in a wide range of analyses and optimizations. It can allow a program to be partially evaluated at compile time, enable checking for divide by zero or null dereference errors, and even check whether accesses of array elements are within the safe bounds of an array. However, often a variable is not limited to just a single value but rather to a range or an interval of values. Inferring this interval of values that a variable may take on, such as [0, 31], can allow safety checks such as
Is the interval of possible values for an index variable contained within the interval of safe values for accessing the elements of an array?
In this project, you will implement interprocedural interval analysis with K-call site sensitivity. The analysis shall be context sensitive and flow sensitive using the approach that we talked about in class. Recall that context sensitivity is approximated to a fixed (configurable) depth. For this project, you should use a depth of 2. If you wish to make the depth optionally configurable via the command line, you may. However, the default depth should be 2. You can find a nice algorithmic presentation of interprocedural algorithms in Figure 6 of the ESP paper.
A good strategy for this project would be to first implement an intra-procedural version of the analysis. Recall that we looked at an analysis detecting problems with operations on opened or closed files in class. You should find that the core dataflow components for interval analysis are similar, but you will have a very different abstract domain, meet operator, and transfer function. You will also need to use a static call graph. You may assume that there will be no function pointers used in the graded test cases. Example code demonstrarting how to create a naive intraprocedural analysis can be found here.
I also like the formulation of the problem presented in [Su 05] & [Gawlitza 09]. Note that these far exceed the requirements of what you must implement. Some of the original formulations are due to Cousot and Cousot (also). There is also an excellent presentation of the problem in the static analysis handbook by Møller and Schwartzbach.
Consider what the abstract domain might be for this problem. Clearly, each unknown value in the program is associated with a range of possible values such as [0,1024]. What is the maximum possible upper bound? What is the minimum possible lower bound? How do different intervals compose a lattice? How should the meet operator combine the abstract values coming from different predecessors in the control flow graph?
One common approach is to allow the artificial values -∞ and ∞ as possible lower and upper bounds (of ranges, not the lattice!), respectively. Recall that termination is often guaranteed in part by using a lattice of bounded height. What is the height of the lattice if you include -∞ and ∞? Unfortunately, it becomes infinite. Thus, the sequence of values computed over the iterations of fixed point computation may form an infinitely ascending chain up the lattice of our abstract domain. If you include these values in your ranges, you must careful in order to ensure that your analysis terminates.
Specifically, you can use a widening operator. Applying a widening operator
in the fixed point computation allows you to artificially bound the number of
iterations required to reach a fixed point. For interval analysis, a simple
widening operator can compare the differences between the new abstract value
for a range against the old abstract value and move the upper bound to ∞
if it increases or move the lower bound to -∞ if it decreases.
Better precision can be achieved by explicitly examining loop bounds and
including evaluation of guards like if
statements [Cousot].
However, widening operators can be avoided through careful construction of the
framework [Su 05], [Leroux 08], [Gawlitza 09].
In our case, we also specifically know what the maximum and minimum possible values are for a variable because we have information about the representation size that LLVM provides. Note that the carefully constructed frameworks presented above can still be more efficient that explicitly using type size information to determine maxima and minima. Why is that?
I reserve the right to include a test that requires a more sophisticated widening operation in order to succeed. You may look to Section 5.11 of Møller and Schwartzbach for ideas.
You should use your interval analysis to identify and report a limited class of
memory errors. Recall that the getelementptr
or gep
instruction in LLVM
allows you to compute a memory address relative to a base address. This
computed address can then be used by a load
or a store
instruction.
For memory accesses to a locally created buffer (made using an AllocaInst), you
should report any loads from or stores to potentially invalid memory locations.
You may assume that any loads from or stores to such pointers will either
directly use the value computed by the gep
instructions in question or that
they may use a casted version of it. You do not need to handle cases where the
pointer itself is stored into and then loaded from some location in memory.
(But you should recall a technique that would help you to do so.)
You may also assume that all local memory buffers are created with a constant size. Thus, you do not need to handle cases where the size of a memory buffer itself depends on the results of your interval analysis. (But what problems would you encounter to do so?)
For any encountered errors, you should report them in CSV format. The format of each line should be:
<Context>, <Function of access>, <Line of access>, <Size of buffer>, <Possible range for access>
Here, <Context>
should be the k-callsite sensitive context. You should report
this as the sequence of line numbers of the corresponding call sites in your
context. Separate entries in the sequence with colons, e.g. 10:5.
The function and line number of the access simply report information about the
load
or the store
where the error occurred. The <Size of buffer>
should simply report the size (in bytes) of the buffer being accessed.
<Possible range for access>
should contain the lower and upper bounds of
in bytes of the possibly accessed offsets from the base of the memory buffer.
The lower and upper bounds should be separated by a colon, e.g. -3:8.
You may not share your code with other students, but you should collaborate in other ways. You are free to discuss the problem, to share your test cases, to point out documentation in LLVM that may be useful, and more. The class mailing lists and CourSys discussion fora can provide invaluable resource but only if you use them.
A template for the project is available here.
The compilation process for the project should be the same as before.
To complete the project, you must integrate the analysis in
tools/overflower/main.cpp
.
You may add or change any other files as you see fit.
To use the tool, I should be able to run
bin/overflower <input bitcode or llvm assembly>
from the build directory.
You must turn in two things via CourSys. Make sure that your name is clearly specified in both.
The source files for your project. Compress your project directory into a .tar.gz file and submit. I will build and test all projects on a 64-bit Linux machine. Make sure that I can test your project by simply running the tool that gets built by the provided template.
A brief (1 page max) write up of your project that explains your basic design as well as what results you may have observed when running your tool on tests. Where does does it produce false positives? False negatives? How might you change things to make it more useful (feel free to brainstorm)?
Both of these should be submitted to CourSys by the deadline.
Monday, October 16 at 11:59pm