Software analysis

Program point: any value of program counter (any line of code)
Invariant: property that holds at a program point across all inputs/executions
Types of program analysis:
- Static analysis: done at compile time, inspects code without execution
  - sound but often incomplete
- Dynamic analysis: done at run-time, monitors executions
  - complete for observed runs but not sound
- Hybrid: combination of static & dynamic analysis
Soundness: never reports something as correct when it is actually wrong
Correctness: finds all possible issues or cases
Control flow graph (CFG): a diagram that shows all possible paths a program can take when it runs
- Node: basic block of code (a straight sequence of statements with no jumps)
- Edge: possible flow of code (like if, while, goto statements)
Normalization: break complex expressions into single steps & use fresh variables
False positive: test falsely claims a problem
False negative: test falsely misses a problem
Precision: how many of the reported issues are truly real ( $p rec i s i o n = \frac{tp}{tp + f p}$ ) → false positive rate
Recall: how many of the real issues were successfully reported ( $rec a ll = \frac{tp}{tp + f n}$ ) → false negative rate
F measure: one number that represents both precision & recall
- higher number = more sound analysis
- $F measure = 2 * \frac{p rec i s i o n * rec a ll}{p rec i s i o n + rec a ll}$
Static analysis workflow: build an intermediate representation (CFG/AST) → choose an abstract domain to approximate program states → apply transfer functions per statement → iterate to a fixed point to converge on facts
Undecidability limits: general program properties (e.g., termination) cannot be decided automatically, so no practical analyzer can be both sound and complete while always terminating
Trade-offs: tools usually pick sound but incomplete (no false negatives, some false positives) or complete but unsound (finds everything it reports, may miss bugs) depending on risk tolerance
Consumers of analysis: compilers (optimization and warnings), QA/security tooling, and IDEs surface these results during development

Software testing

Software testing: the process used to identify the completeness, correctness, and quality of a software → checks the following:
- Specification: does the system have all the features advertised/specified?
- Functionality: do those features actually work?
- Performance: how well does it run under different conditions?
“Good” test:
- high probability of finding an error
- not redundant
- should be “best of breed”
- should be neither too simple or too complex
Black-box testing: based on requirements/functionality, no knowledge of code implementation
Types of black-box testing:
- Requirements based: pick one from each category
- Equivalence partitioning:
  1. identify classes of inputs with same behaviour
  2. test at least one member
  3. assume the rest of the members will be the same
- Boundary value analysis: testing conditions near the bounds of inputs → likely source of programmer error (< vs <=, etc.)
- Positive/negative: checks both good/bad results
- Decision tables
- State-based: based on object state diagrams
- Compatibility testing: run the software in different environments
  - e.g. open a website on different browsers/devices (firefox, mobile, …)
White-box testing: use knowledge of the code to write the tests
Types of white-box testing:
- Statement testing: test single statements
- Branch testing: make sure each possible outcome from a condition is tested at least once
- Loop testing: check for infinite loops, performance, loop initialization, uninitialized variables
- Path testing: make sure all paths are executed
- Exhaustive testing: test every possible path
- Selective testing: choose & test specific paths (the most likely for a user to encounter)
Code coverage: how much of the code is tested by a given test suite
- Function coverage: checks if every function in the program was called at least once
- Statement coverage: checks if every line of code was executed at least once
- Branch coverage: checks if every possible true/false path in decisions was tested
Mutation testing: run small changes (mutants) into source code, run test suite to see if changes are detected
- if output differs → mutant is killed
- if output same → mutant survives (weak test suite)
- Arithmetic mutations: change math ops (+ → -, * → /)
- Boolean mutations: change logical ops (&& → ||, == → !=)
- Statement mutations: remove/replace statements
- Variable mutations: rename/replace variables
- Method call mutations: modify calls or parameters
Mutation score: $\frac{killed mutants}{total mutants}$ , higher score → better test suite
Types of software testing:
- Unit testing: each module is tested individually (white-box, by developers)
- Integration testing: focus on errors at module interfaces
- System testing: entire system is tested (alpha → in house team, beta → external team)

Random testing

Random testing/fuzzing: program is bombarded with random, unexpected, or malformed data
Fuzzing approaches:
- Generation-based: create inputs from scratch using the input format (first generation, purely random)
- Mutation-based: mutate real seeds to explore neighbours (second generation)
- Coverage-guided: keep mutations that explore new paths or behaviours (third generation feedback loop)
Coverage-guided fuzzers:
- AFL: external process that mutates queue entries, trims cases, and keeps inputs that hit new edges
- LibFuzzer: in-process library that drives fuzz targets with compiler instrumentation to guide mutations
- OSS-Fuzz / ClusterFuzz: Google-hosted service + infrastructure that runs fuzzers continuously for OSS projects
Testing concurrent programs: requires not only specific program inputs, but also specific thread schedules
- add random delays (sleep(x)) → hopefully changes thread schedules p> hopefully one causes a bug
Bug types fuzzing exposes: input validation failures, race conditions, boundary/overflow issues, poor error handling, resource leaks, compatibility gaps, security bugs, and performance regressions
Pros: simple to start, formats agnostic, can uncover deep/rare failures
Cons: uneven coverage without guidance, hard-to-reproduce crashes, and potentially large time/resource costs
Depth of a concurrency bug: number of ordering constraints (e.g. if/else’s) a schedule has to satisfy to find a bug

Delta debugging

Delta debugging: a method that automatically minimizes a test case that produces a bug
Why simplify failing inputs: smaller cases are easier to explain, reproduce faster, reduce state space, and often subsume duplicate reports
Global Minimum: the smallest test case that makes a test fail
Local Minimum: the smallest test case with respect to itself that causes a test to fail
- removing one or more items from the failing test case would not create any other smaller test case
n-minimal: if making between 1 and n changes from $c$ causes the test to no longer fail
1-minimal: if making any single change from c causes the test to no longer fail
2-minimal: if making any 2 changes from c causes the test to no longer fail
k-minimal: if making any k changes from c causes the test to no longer fail
- different than n-minimal, since n-minimal is any 1 to n, and this is exactly k changes
Algorithm 1 (linear reduction): removes 1 element at a time and tests
- slow but simple, finds the smallest failing input by gradually trimming pieces
Algorithm 2 (binary search reduction): splits into 2 halves, tests each half, and recursively continues on the failing part
- faster because it cuts the search space in half each time
Algorithm 3 (hierarchical/balanced reduction): if both halves pass, it means the failure needs elements from both sides
- keep one half while reducing the other, then swap
- combines both minimal failing parts for the smallest possible failing input
Test function: treat the failing configuration $c_{F}$ as a set of changes and define $test (c) \in {pass, fail, ?}$ for each subset—delta debugging repeatedly queries this oracle
ddmin loop: partition $c_{F}$ into $n$ disjoint subsets ( $Δ_{1} \dots Δ_{n}$ ), test each subset and its complement; if a smaller failing chunk is found, recurse on it, otherwise increase $n$ (finer granularity) until 1-minimal

Connor's Notes

⬅️ Back to portfolio

Explorer

COSC 3P95: Midterm Notes

Software analysis

Software testing

Random testing

Delta debugging

Graph View

Table of Contents