Module 3: Static Analysis Fundamentals -- Student Guide

Welcome to Module 3! You will build control flow graphs, implement a generic dataflow framework with powerset lattices and fixpoint iteration, then apply it to two classic analyses (reaching definitions and live variables), and finally construct call graphs for interprocedural analysis -- all in OCaml.

Source Code  ->  AST  ->  CFG Construction  ->  Dataflow Analysis  ->  Results
                               |                       |
                  +------------+            +----------+-----------+
                  |                         |          |           |
             Basic Blocks              Lattice    Transfer Fn   Fixpoint
             + Edges                  (powerset)  (gen/kill)    Iteration
                                                                   |
                                                    +--------------+--------------+
                                                    |                             |
                                            Forward (may)                 Backward (may)
                                          Reaching Definitions           Live Variables
                                           "which defs reach             "which vars are
                                            this point?"                  used later?"

Exercises: 5 (72 tests) | Lab: Lab 3 -- Static Checker (5 tests) Estimated time: 2--3 hours for exercises, 2--3 hours for the lab

How Exercises Work
Background: Key Concepts
Exercise 1: CFG Construction (14 tests)
Exercise 2: Dataflow Framework (11 tests)
Exercise 3: Reaching Definitions (15 tests)
Exercise 4: Live Variables (14 tests)
Exercise 5: Interprocedural Analysis (18 tests)
Lab 3: Static Checker (5 tests)
Troubleshooting
Exercise Progression Cheat Sheet

1. How Exercises Work

Each exercise has this structure:

exercises/cfg-construction/
├── dune                          <- tells dune which dirs to build
├── starter/
│   ├── dune                      <- library definition
│   ├── cfg.ml                    <- YOUR FILE -- edit this
│   ├── cfg.mli                   <- interface (read-only)
│   └── exercises.ml              <- YOUR FILE -- edit this
└── tests/
    ├── dune                      <- test configuration
    └── test_cfg.ml               <- read-only test file

The workflow:

Read the starter file -- types and docstrings are provided; every function body is failwith "TODO: ..." with a comment explaining what to implement.
Implement one function at a time -- replace the failwith with real code.

Run tests -- see your progress:

dune runtest modules/module3-static-analysis/exercises/cfg-construction/

Interpret the output:

Early on (many TODOs remaining), you may see:
```
Fatal error: exception Failure("TODO: ...")
```
This means the tests hit a failwith "TODO" stub and crashed before they could run individually. Keep implementing functions from the top of the file.

Once enough functions are implemented, you will see per-test results:
```
..E.FEEEE
```
- . = test passed
- E = error (your code threw an exception -- likely a remaining TODO)
- F = failure (your code ran but returned the wrong answer -- check logic)
Repeat until all tests pass: .............. (all dots!)

Tips:

Implement functions in the order they appear in the file -- later functions often depend on earlier ones.
Read the test file (tests/test_*.ml) to understand exactly what is expected.
Tests are read-only -- do not modify them.
For OCaml/dune environment setup, see the setup instructions in the Module 4 student guide or the project-level README.

2. Background: Key Concepts

Before starting the exercises, review these ideas from the slides (slides/static-analysis.md).

Control Flow Graphs (CFGs)

A CFG represents a program as a directed graph. Each node is a basic block (a maximal sequence of statements with one entry and one exit). Each edge represents possible control flow between blocks.

    ENTRY              Every CFG has two distinguished blocks:
      |                  ENTRY -- where execution begins
      v                  EXIT  -- where execution ends
     B1
    / \               Basic blocks contain statements.
   v   v              Edges encode branches, loops, and
  B2   B3             fall-through control flow.
   \  /
    v
  B_join
    |
    v
   EXIT

Lattices

A lattice provides the mathematical structure for dataflow facts:

bottom -- the least element (no information / unreachable)
top -- the greatest element (all information / unknown)
join (least upper bound) -- combines information from merging paths
meet (greatest lower bound) -- intersects information

For this module, you will work with powerset lattices where values are sets of strings and join = union, meet = intersection.

Transfer Functions

A transfer function models how a single basic block transforms dataflow facts.

Forward analysis:   OUT[B] = transfer(IN[B])
                    e.g., OUT[B] = gen[B] U (IN[B] - kill[B])

Backward analysis:  IN[B] = transfer(OUT[B])
                    e.g., IN[B] = use[B] U (OUT[B] - def[B])

Fixpoint Iteration

The iterative algorithm repeats until no values change:

1. Initialize all IN[B] and OUT[B] to the initial value (e.g., empty set)
2. Repeat:
     For each block B:
       - Merge values from predecessor/successor blocks
       - Apply the transfer function
3. Stop when nothing changes (fixpoint reached)

Guaranteed to terminate when the lattice has finite height and the transfer functions are monotone.

Forward vs. Backward Analysis

+-------------------+----------------------+----------------------+
|                   | Forward              | Backward             |
+-------------------+----------------------+----------------------+
| Direction         | ENTRY -> EXIT        | EXIT -> ENTRY        |
| Merge over        | predecessors         | successors           |
| Transfer computes | OUT from IN          | IN from OUT          |
| Example           | Reaching definitions | Live variables       |
| Question answered | "Which defs MIGHT    | "Which vars MIGHT    |
|                   |  reach this point?"  |  be used later?"     |
+-------------------+----------------------+----------------------+

The AST types

Programs are represented as Shared_ast.Ast_types (defined in lib/shared_ast/ast_types.ml):

type expr =
  | IntLit of int              (* 42 *)
  | BoolLit of bool            (* true *)
  | Var of string              (* x *)
  | BinOp of op * expr * expr  (* x + y *)
  | UnaryOp of uop * expr      (* -x *)
  | Call of string * expr list (* f(x, y) *)

type stmt =
  | Assign of string * expr           (* x = e *)
  | If of expr * stmt list * stmt list (* if e { ... } else { ... } *)
  | While of expr * stmt list         (* while e { ... } *)
  | Return of expr option             (* return e *)
  | Print of expr list                (* print(e1, e2) *)
  | Block of stmt list                (* { s1; s2; ... } *)

type func_def = { name: string; params: string list; body: stmt list }
type program = func_def list

3. Exercise 1: CFG Construction (14 tests)

Goal: Build control flow graphs from AST statement lists. Implement the core CFG operations (add edges, query predecessors/successors) and three CFG construction patterns (sequential, if-else diamond, while loop with back edge).

Time: ~30 minutes

Files to edit:

exercises/cfg-construction/starter/cfg.ml -- core CFG operations
exercises/cfg-construction/starter/exercises.ml -- CFG construction patterns

Also provided (read-only):

exercises/cfg-construction/starter/cfg.mli -- interface specification

CFG patterns you will build

Sequential:          If-Else (Diamond):        While (Back Edge):

 ENTRY                  ENTRY                    ENTRY
   |                      |                        |
   v                      v                        v
  B1                   B_cond                    B_pre
   |                   /    \                      |
   v                  v      v                     v
  EXIT             B_then  B_else              B_cond  <---+
                      \    /                   /    \      |
                       v  v                 B_body   \     |
                      B_join                  |       \    |
                        |                     +--------+   |
                        v                              |
                       EXIT                          B_post
                                                       |
                                                       v
                                                      EXIT

What to implement (in order)

In cfg.ml:

#	Function	Hint
1	`add_edge cfg src dst`	Find both blocks in `cfg.blocks`, append `dst` to src's succs and `src` to dst's preds, update the map with `StringMap.add`
2	`predecessors cfg label`	`StringMap.find` the block, return its `preds`
3	`successors cfg label`	`StringMap.find` the block, return its `succs`
4	`to_string cfg`	`StringMap.fold` over blocks; for each, print label, stmt count, succs, preds

In exercises.ml:

#	Function	Hint
5	`build_cfg_sequential stmts`	Create ENTRY, B1 (all stmts), EXIT blocks; add edges ENTRY->B1->EXIT
6	`build_cfg_ifelse stmts`	Partition around the `If` node; create B_cond, B_then, B_else, B_join; wire the diamond
7	`build_cfg_while stmts`	Partition around the `While` node; create B_pre, B_cond, B_body, B_post; wire the back edge B_body->B_cond

Run tests

dune runtest modules/module3-static-analysis/exercises/cfg-construction/

Starter output: .EEEEE.EEEEEEE

Some tests already pass (the dots) because the provided create_block helper function works out of the box -- its tests do not depend on your TODO stubs. The E results are the tests that call your unimplemented functions. As you implement add_edge, predecessors, successors, and to_string, the middle tests will turn to dots. Then tackle exercises.ml for the remaining tests.

Hint for build_cfg_ifelse: You need to split the statement list into three parts: statements before the If, the If itself (from which you extract then_stmts and else_stmts), and statements after the If. A recursive helper or List.fold_left works well for this partitioning.

Hint for build_cfg_while: Same partitioning idea, but around the While node. Remember that B_cond is empty (the condition is implicit in the branching edges), and the key addition is the back edge from B_body to B_cond.

4. Exercise 2: Dataflow Framework (11 tests)

Goal: Implement a powerset lattice (sets of strings) and a generic iterative fixpoint solver that works for both forward and backward analyses.

Time: ~30 minutes

Files to edit:

exercises/dataflow-framework/starter/lattice.ml -- powerset lattice operations
exercises/dataflow-framework/starter/dataflow.ml -- iterative fixpoint solver

The powerset lattice

     {a, b, c}        <- top (the full universe)
      / | \
  {a,b} {a,c} {b,c}
    / \  / \  / \
   {a} {b}  {c}
     \  |  /
       {}              <- bottom (empty set)

What to implement (in order)

In lattice.ml:

#	Function	Hint
1	`bottom`	The empty set: `StringSet.empty`
2	`top`	The universe: `!universe` (dereference the ref)
3	`join a b`	Set union: `StringSet.union`
4	`meet a b`	Set intersection: `StringSet.inter`
5	`equal a b`	Set equality: `StringSet.equal`
6	`to_string s`	Format as `{a, b, c}` -- use `StringSet.elements` and `String.concat`

In dataflow.ml:

#	Function	Hint
7	`solve analysis cfg`	Implement the iterative worklist algorithm (see below)

The fixpoint algorithm

Forward case:
  1. Initialize IN[B] = OUT[B] = analysis.init for every block B
  2. Repeat until nothing changes:
       For each block B:
         IN[B]  = merge over all predecessors P: OUT[P]
         OUT[B] = transfer(B, IN[B])
  3. Return (label, IN[B], OUT[B]) for each block

Backward case:
  Swap the roles of IN/OUT and predecessors/successors:
    OUT[B] = merge over all successors S: IN[S]
    IN[B]  = transfer(B, OUT[B])

Hint: Use StringMap to store the IN and OUT values for each block. In each iteration, check whether any value changed using analysis.equal. When nothing changes, you have reached the fixpoint.

Run tests

dune runtest modules/module3-static-analysis/exercises/dataflow-framework/

Starter output:

Fatal error: exception Failure("TODO: return the least element of the powerset lattice")

This fatal error occurs because the bottom value is evaluated at module load time. The test runner crashes immediately before it can show per-test results. Once you implement bottom (and the other lattice functions), you will start seeing per-test output like ........EEE as the solver tests remain TODO.

5. Exercise 3: Reaching Definitions (15 tests)

Goal: Implement a forward may-analysis that tracks which variable definitions can reach each program point.

Time: ~25 minutes

Files to edit:

exercises/reaching-definitions/starter/reaching_definitions.ml -- gen/kill sets + fixpoint

Also provided (read-only):

exercises/reaching-definitions/starter/example_program.ml -- the test program
exercises/reaching-definitions/starter/worksheet.md -- guided walkthrough

The example program

    B1: a = 1  (d1)           CFG:   B1
        b = 2  (d2)                 / \
                                   v   v
    B2: a = 3  (d3)              B2    B3
        c = a  (d4)               \   /
                                   v v
    B3: b = 4  (d5)               B4
        c = b  (d6)

    B4: print(a, b, c)

Transfer function

OUT[B] = gen[B] U (IN[B] - kill[B])
IN[B]  = U{ OUT[P] | P is a predecessor of B }

gen[B]: definitions created in block B (last def of each variable in B)
kill[B]: definitions killed by B (other defs of variables that B defines)

What to implement (in order)

#	Function	Hint
1	`compute_gen defs block`	Filter `defs` by block, keep only the last definition per variable, return their `def_id`s as a `StringSet`
2	`compute_kill defs block`	For each variable defined in this block, find all OTHER definitions of that variable from any block; exclude gen[B]
3	`analyze cfg defs`	Iterative fixpoint: initialize OUT[B] = {} for all B, repeat merge + transfer until stable

Run tests

dune runtest modules/module3-static-analysis/exercises/reaching-definitions/

Starter output: EEEEEEEEEEEEEEE

All 15 tests error because every function is still a TODO stub. The tests are organized in groups: 4 gen tests, 4 kill tests, 7 analyze tests. Implement compute_gen first to get the first 4 passing, then compute_kill, then analyze.

Hint for compute_gen: Filter defs to those in the given block. If a variable has multiple definitions in the same block, only the last one counts. Use List.filter and List.rev or fold from right to left.

Hint for analyze: The CFG is given as (label, predecessor_labels) pairs (not full block objects). Use StringMap to store the OUT set for each block and iterate until no OUT set changes.

6. Exercise 4: Live Variables (14 tests)

Goal: Implement a backward may-analysis that tracks which variables may be used in the future (are "live") at each program point.

Time: ~25 minutes

Files to edit:

exercises/live-variables/starter/live_variables.ml -- use/def extraction + fixpoint

Also provided (read-only):

exercises/live-variables/starter/comparison.ml -- side-by-side comparison of reaching defs vs. live variables (reference reading)

The example program

         B1: def={a,b}, use={}
        /  \
       v    v
      B2    B3
 def={a,c}  def={b,c}
 use={a}    use={b}
       \   /
        v v
         B4: def={}, use={a,b,c}

Transfer function (backward!)

IN[B]  = use[B] U (OUT[B] - def[B])
OUT[B] = U{ IN[S] | S is a successor of B }

use[B]: variables read in B before being (re)defined
def[B]: variables assigned/written in B

What to implement (in order)

#	Function	Hint
1	`compute_use (label, defs, uses)`	Return `StringSet.of_list uses`
2	`compute_def (label, defs, uses)`	Return `StringSet.of_list defs`
3	`analyze blocks`	Backward iterative fixpoint: initialize IN[B] = {} for all B, merge over successors to get OUT, apply transfer to get IN, repeat until stable

Run tests

dune runtest modules/module3-static-analysis/exercises/live-variables/

Starter output: EEEEEEEEEEEEEE

All 14 tests error because every function is a TODO. The tests include 3 compute_use tests, 2 compute_def tests, 6 diamond-CFG analysis tests, and 3 loop analysis tests. Start with compute_use and compute_def (straightforward one-liners), then tackle analyze.

Key difference from Exercise 3: This is a backward analysis. You merge over successors (not predecessors) to compute OUT, then apply the transfer function to compute IN. The fixpoint check is on IN values (not OUT).

Hint for analyze: The blocks are given as (label, defined_vars, used_vars, successor_labels) tuples. Build a StringMap keyed by label. In each iteration, compute OUT[B] by unioning the IN values of B's successors, then compute IN[B] via the transfer function. Iterate until no IN set changes.

7. Exercise 5: Interprocedural Analysis (18 tests)

Goal: Build a call graph from a multi-function program and use it to answer questions about function reachability and recursion detection.

Time: ~30 minutes

Files to edit:

exercises/interprocedural-analysis/starter/call_graph.ml -- call graph construction + queries

Also provided (read-only):

exercises/interprocedural-analysis/starter/example_program.ml -- the test program
exercises/interprocedural-analysis/starter/context_analysis.md -- background reading

The example program

def helper(param):               Call graph:
    return param + 1
                                    main
def process_data(x, y):              |
    temp = x * 2                     v
    result1 = helper(temp)       process_data
    result2 = helper(y)              |
    return result1 + result2         v
                                   helper
def main():
    a = 5
    b = 10
    output = process_data(a, b)
    print(output)

What to implement (in order)

#	Function	Hint
1	`calls_in_expr expr`	Recursively walk the expression tree; collect function names from `Call(name, args)` nodes; do not forget to recurse into args
2	`calls_in_stmts stmts`	Walk each statement; recurse into `Assign`, `If` (cond + both branches), `While` (cond + body), `Return`, `Print`, `Block`. Use `let rec ... and ...` for mutual recursion
3	`build_call_graph program`	For each `func_def`, add its name to `nodes` and record `calls_in_stmts func.body` as its edge set in a `StringMap`
4	`reachable_from cg start`	BFS or DFS from `start` following call graph edges; return all reachable functions (not including `start` unless it is in a cycle)
5	`find_recursive cg`	A function `f` is recursive if `f` is in `reachable_from cg f`. Check every node and return a sorted list

Run tests

dune runtest modules/module3-static-analysis/exercises/interprocedural-analysis/

Starter output:

Fatal error: exception Failure("TODO: Build call graph from a program")

This fatal error occurs because the test file builds a call graph at the top level (module load time) using build_call_graph. Since that function is still a TODO, the test runner crashes immediately before showing per-test results. You need to implement calls_in_expr, calls_in_stmts, and build_call_graph first. Once those three are done, the test runner will be able to start and you will see per-test output for the remaining functions.

Hint for calls_in_expr: You need let rec because Call arguments are themselves expressions that may contain nested calls. Pattern match on each expression variant:

IntLit, BoolLit, Var -- return []
Call(name, args) -- name :: List.concat_map calls_in_expr args
BinOp(_, e1, e2) -- recurse into both sub-expressions
UnaryOp(_, e) -- recurse into the sub-expression

Hint for reachable_from: Use a StringSet as a "visited" set. Start with the callees of start, add them to the visited set, and continue BFS/DFS until the frontier is empty. Do not add start to the initial visited set -- it should only appear in the result if a cycle leads back to it.

8. Lab 3: Static Checker (5 tests)

After completing the exercises, tackle Lab 3. You will build a static analysis checker that detects common code quality issues by walking the AST.

Location: labs/lab3-static-checker/

Read the full spec: labs/lab3-static-checker/README.md

Files to edit

File	What you implement
`starter/rules.ml`	Three analysis rules: `check_unused_variables`, `check_unreachable_code`, `check_shadowed_variables`
`starter/checker.ml`	`check_program` -- aggregator that runs all rules on every function and collects issues

Already provided (read-only)

File	Contents
`starter/reporter.ml`	Issue type, severity, formatting, print_report
`starter/tests/test_checker.ml`	5 test cases

The three rules

Rule	What it detects	Approach
`check_unused_variables`	Variables assigned but never read	Walk stmts to collect assigned vars, walk exprs to collect used vars, report the difference
`check_unreachable_code`	Statements after a `Return` in the same list	Scan each statement list; once a `Return` is found, all subsequent statements are unreachable
`check_shadowed_variables`	Inner-scope assignment shadows outer-scope variable	Track defined vars at each scope level; when entering `If`/`While` bodies, check for re-assignment of outer vars

Build and test

# Build
dune build labs/lab3-static-checker/

# Run tests
dune runtest labs/lab3-static-checker/

Starter output: EEEEE

All 5 tests error because every function is a TODO. The tests check: unused variables (1), unreachable code (1), shadowed variables (1), clean program produces no issues (1), and aggregation across multiple rules (1).

Tips

Start with check_unused_variables -- it is the most straightforward rule.
For check_unreachable_code, remember to recurse into nested If/While bodies to find unreachable code inside them.
check_program should iterate over every func_def in the program and apply all three rules, concatenating the resulting issue lists.
Look at reporter.ml to see the issue record type. Each rule returns a list of Reporter.issue values with appropriate rule, message, severity, and location fields.

9. Troubleshooting

Build errors

Error	Fix
`ocaml: command not found`	Run `eval $(opam env)` or add it to your `~/.bashrc`/`~/.zshrc`
`dune: command not found`	`opam install dune && eval $(opam env)`
`Error: Unbound module OUnit2`	`opam install ounit2`
`Error: Unbound module Shared_ast`	Run `dune build` from the repo root first (not from inside an exercise)

Test errors

Symptom	Meaning
`Fatal error: exception Failure("TODO: ...")`	The first function called is still a TODO -- implement it first
`EEEEEEEEEE`	Every test errors -- functions still have `failwith "TODO"`
`.EEEEE.EEEEEEE`	Some tests pass (provided helpers work), rest are TODOs
`..EEEE`	First 2 tests pass, rest still TODO
`..F.EE`	3rd test fails (wrong answer) -- read error message for expected vs actual

Common OCaml mistakes

Mistake	Fix
`match` not exhaustive	Add all cases: `Assign`, `If`, `While`, `Return`, `Print`, `Block`
`StringSet` comparison with `=`	Use `StringSet.equal` -- polymorphic `=` compares AVL tree structure, not logical set equality
Infinite loop in fixpoint	Make sure your "changed" flag is correctly tracking whether any IN/OUT value actually changed
`Unbound module Cfg`	Your `exercises.ml` should `open` or qualify with `Cfg.` -- the library wraps modules
`Stack overflow` in reachable_from	Track visited nodes to avoid infinite recursion through cycles
`type t is abstract`	Use the module prefix: `Cfg.StringMap.find`, `StringSet.union`, etc.

Running tests from the right directory

Always run dune commands from the repository root, not from inside an exercise directory:

# CORRECT -- from repo root:
dune runtest modules/module3-static-analysis/exercises/cfg-construction/

# WRONG -- from inside the exercise (will fail to find shared libraries):
cd modules/module3-static-analysis/exercises/cfg-construction/
dune runtest   # ERROR: can't find shared_ast

10. Exercise Progression Cheat Sheet

Exercise 1: CFG Construction          <- build the graph structure that all analyses use
     |
Exercise 2: Dataflow Framework        <- generic lattice + fixpoint solver
     |
Exercise 3: Reaching Definitions      <- forward may-analysis (gen/kill sets)
     |
Exercise 4: Live Variables            <- backward may-analysis (use/def sets)
     |
Exercise 5: Interprocedural Analysis  <- call graphs, reachability, recursion detection
     |
Lab 3: Static Checker                 <- apply AST walking to detect real code quality issues

Recommended order: Exercises 1 through 5 are designed to be done in sequence. Each builds on concepts from the previous one:

Exercise 1 teaches CFG structure (nodes, edges, blocks)
Exercise 2 gives you the lattice and solver machinery
Exercise 3 applies the framework as a forward analysis
Exercise 4 applies the framework as a backward analysis (the dual)
Exercise 5 extends analysis to multiple functions (interprocedural)

Full path: All 5 exercises + Lab 3 (72 exercise tests + 5 lab tests = 77 total)

Good luck!

Static Analysis Fundamentals

Understanding Static Analysis

CFG Explorer

Exercises

CFG Construction

Dataflow Framework

Reaching Definitions

Live Variables

Interprocedural Analysis

Module 3: Static Analysis Fundamentals -- Student Guide

Table of Contents

1. How Exercises Work

2. Background: Key Concepts

Control Flow Graphs (CFGs)

Lattices

Transfer Functions

Fixpoint Iteration

Forward vs. Backward Analysis

The AST types

3. Exercise 1: CFG Construction (14 tests)

CFG patterns you will build

What to implement (in order)

Run tests

4. Exercise 2: Dataflow Framework (11 tests)

The powerset lattice

What to implement (in order)

The fixpoint algorithm

Run tests

5. Exercise 3: Reaching Definitions (15 tests)

The example program

Transfer function

What to implement (in order)

Run tests

6. Exercise 4: Live Variables (14 tests)

The example program

Transfer function (backward!)

What to implement (in order)

Run tests

7. Exercise 5: Interprocedural Analysis (18 tests)

The example program

What to implement (in order)

Run tests

8. Lab 3: Static Checker (5 tests)

Files to edit

Already provided (read-only)

The three rules

Build and test

Tips

9. Troubleshooting

Build errors

Test errors

Common OCaml mistakes

Running tests from the right directory

10. Exercise Progression Cheat Sheet