Before You Code

Understanding Tools Integration

Before building the capstone analyzer in OCaml, let's understand how real-world tools combine multiple analysis passes into a unified pipeline, aggregate findings into a common format, and produce actionable reports for developers and CI/CD systems.

Why Tools Integration?

You've built individual analyses — CFGs, dataflow, abstract interpretation, taint tracking. But real-world tools like ESLint, Semgrep, and Coverity don't run just one analysis. They combine multiple passes into a unified pipeline, aggregate findings, and produce actionable reports. This module is about building that integration layer.

Single Pass vs. Multi-Pass

Multi-Pass Pipeline

Chain multiple analyses together. Each pass produces findings. The pipeline orchestrates execution and collects results.

Unified Findings

A common finding format: severity, location, message, category. Different passes produce the same type of output.

Actionable Reports

Group by severity, by pass, or by location. Format for developers (inline) or security teams (summary). The same data, different views.

Analysis Findings

Every analysis pass produces findings — structured records describing potential issues. A unified finding type lets the pipeline collect results from different passes into one stream.

Finding Record Structure
type finding = {
  id       : string;       (* unique identifier *)
  pass     : string;       (* which analysis pass *)
  severity : severity;     (* Critical|High|Medium|Low|Info *)
  line     : int;          (* source location *)
  message  : string;       (* human-readable description *)
  category : string;       (* e.g. "sql-injection", "div-by-zero" *)
}

Severity Levels

CriticalExploitable vulnerabilities (SQLi, RCE)
HighLikely crashes or security issues (div-by-zero)
MediumCode quality issues (unreachable code, weak crypto)
LowStyle issues (unused variables)
InfoInformational notes (hardcoded values)

Categories

Categories group findings by what kind of issue they represent, regardless of which pass found them.

sql-injectiondiv-by-zeroxssunused-variableunreachablepath-traversalweak-cryptonull-deref

Multiple passes can produce findings in the same category — e.g., both taint analysis and constant propagation might flag injection risks. Deduplication is part of the reporting layer.

Dead Code Detection

Dead code detection combines multiple analyses into a single pass: live variables finds unused assignments, reachability finds unreachable statements, and reaching definitions finds dead stores. It's a perfect example of multi-analysis integration.

Dead Code Examples
x = 5
y = 10    // y is assigned but never read
result = x + 1
Finding
Variable 'y' is assigned on line 2 but never used (Low)
How It's Detected
Live variable analysis shows y is dead after assignment — no path from line 2 to any read of y.

Why Dead Code Matters

Dead code isn't just messy — it can hide bugs. An unused variable might indicate a missing operation. Unreachable code might be a guard that was supposed to execute. Dead stores waste computation. Flagging these issues helps developers maintain cleaner, more correct codebases.

Configurable Pipelines

A pipeline defines which passes to run, in what order, with what options. Think of it like a build system for analysis: each pass is a task, and the pipeline orchestrates them.

Pipeline Architecture
┌──────────────┐
│  Source Code  │
└──────┬───────┘
       │
       ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Pass 1:     │    │  Pass 2:     │    │  Pass 3:     │
│  Taint       │───▶│  Sign        │───▶│  Dead Code   │
│  Analysis    │    │  Analysis    │    │  Detector    │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       ▼                   ▼                   ▼
┌─────────────────────────────────────────────────────┐
│              Finding Aggregator                      │
│  Collect, deduplicate, sort by severity              │
└──────────────────────┬──────────────────────────────┘
                       │
                       ▼
              ┌──────────────┐
              │    Report    │
              └──────────────┘

Parallel Execution

Independent passes (taint vs sign) can run in parallel for speed. Dependent passes (e.g., inlining before interprocedural) must run sequentially. The pipeline respects these constraints.

Stop on Critical

In CI/CD, you might want to stop the pipeline immediately when a Critical finding is detected — no point running more passes if deployment is already blocked. This is a configurable option.

Analysis Reporting

The report is the user-facing output of your tool. The same findings can be presented in multiple formats: grouped by severity for triage, by pass for understanding, or as structured data (JSON/SARIF) for integration with CI/CD tools.

Report Formats
Critical (1)
L5SQL Injection: tainted query reaches db.exec()
High (1)
L6Division by zero: divisor is Zero
Medium (1)
L3Weak hash: md5() is deprecated
Low (1)
L7Unused variable: 'unused'

Real-World Integration

Tools like GitHub Code Scanning use SARIF (Static Analysis Results Interchange Format) — a JSON standard for analysis results. Your reporter module converts internal findings into this kind of structured output, making your tool compatible with the broader ecosystem.

Ready to Build Your Own?

You now understand how multi-pass pipelines work, how findings are structured and aggregated, how dead code detection combines analyses, and how configurable pipelines and reporters turn raw results into actionable output. Time to build the integration layer in OCaml.