Software analysis

  • Program point: any value of program counter (any line of code)
  • Invariant: property that holds at a program point across all inputs/executions
  • Types of program analysis:
    • Static analysis: done at compile time, inspects code without execution
      • sound but often incomplete
    • Dynamic analysis: done at run-time, monitors executions
      • complete for observed runs but not sound
    • Hybrid: combination of static & dynamic analysis
  • Soundness: never reports something as correct when it is actually wrong
  • Correctness: finds all possible issues or cases
  • Control flow graph (CFG): a diagram that shows all possible paths a program can take when it runs
    • Node: basic block of code (a straight sequence of statements with no jumps)
    • Edge: possible flow of code (like if, while, goto statements)
  • Normalization: break complex expressions into single steps & use fresh variables
  • False positive: test falsely claims a problem
  • False negative: test falsely misses a problem
  • Precision: how many of the reported issues are truly real () false positive rate
  • Recall: how many of the real issues were successfully reported () false negative rate
  • F measure: one number that represents both precision & recall
    • higher number = more sound analysis
  • Static analysis workflow: build an intermediate representation (CFG/AST) → choose an abstract domain to approximate program states → apply transfer functions per statement → iterate to a fixed point to converge on facts
  • Undecidability limits: general program properties (e.g., termination) cannot be decided automatically, so no practical analyzer can be both sound and complete while always terminating
  • Trade-offs: tools usually pick sound but incomplete (no false negatives, some false positives) or complete but unsound (finds everything it reports, may miss bugs) depending on risk tolerance
  • Consumers of analysis: compilers (optimization and warnings), QA/security tooling, and IDEs surface these results during development

Software testing

  • Software testing: the process used to identify the completeness, correctness, and quality of a software checks the following:
    • Specification: does the system have all the features advertised/specified?
    • Functionality: do those features actually work?
    • Performance: how well does it run under different conditions?
  • “Good” test:
    • high probability of finding an error
    • not redundant
    • should be “best of breed”
    • should be neither too simple or too complex
  • Black-box testing: based on requirements/functionality, no knowledge of code implementation
  • Types of black-box testing:
    • Requirements based: pick one from each category
    • Equivalence partitioning:
      1. identify classes of inputs with same behaviour
      2. test at least one member
      3. assume the rest of the members will be the same
    • Boundary value analysis: testing conditions near the bounds of inputs likely source of programmer error (< vs <=, etc.)
    • Positive/negative: checks both good/bad results
    • Decision tables
    • State-based: based on object state diagrams
    • Compatibility testing: run the software in different environments
      • e.g. open a website on different browsers/devices (firefox, mobile, …)
  • White-box testing: use knowledge of the code to write the tests
  • Types of white-box testing:
    • Statement testing: test single statements
    • Branch testing: make sure each possible outcome from a condition is tested at least once
    • Loop testing: check for infinite loops, performance, loop initialization, uninitialized variables
    • Path testing: make sure all paths are executed
    • Exhaustive testing: test every possible path
    • Selective testing: choose & test specific paths (the most likely for a user to encounter)
  • Code coverage: how much of the code is tested by a given test suite
    • Function coverage: checks if every function in the program was called at least once
    • Statement coverage: checks if every line of code was executed at least once
    • Branch coverage: checks if every possible true/false path in decisions was tested
  • Mutation testing: run small changes (mutants) into source code, run test suite to see if changes are detected
    • if output differs → mutant is killed
    • if output same → mutant survives (weak test suite)
    • Arithmetic mutations: change math ops (+ → -, * → /)
    • Boolean mutations: change logical ops (&& → ||, == → !=)
    • Statement mutations: remove/replace statements
    • Variable mutations: rename/replace variables
    • Method call mutations: modify calls or parameters
  • Mutation score: , higher score better test suite
  • Types of software testing:
    • Unit testing: each module is tested individually (white-box, by developers)
    • Integration testing: focus on errors at module interfaces
    • System testing: entire system is tested (alpha in house team, beta external team)

Random testing

  • Random testing/fuzzing: program is bombarded with random, unexpected, or malformed data
  • Fuzzing approaches:
    • Generation-based: create inputs from scratch using the input format (first generation, purely random)
    • Mutation-based: mutate real seeds to explore neighbours (second generation)
    • Coverage-guided: keep mutations that explore new paths or behaviours (third generation feedback loop)
  • Coverage-guided fuzzers:
    • AFL: external process that mutates queue entries, trims cases, and keeps inputs that hit new edges
    • LibFuzzer: in-process library that drives fuzz targets with compiler instrumentation to guide mutations
    • OSS-Fuzz / ClusterFuzz: Google-hosted service + infrastructure that runs fuzzers continuously for OSS projects
  • Testing concurrent programs: requires not only specific program inputs, but also specific thread schedules
    • add random delays (sleep(x)) hopefully changes thread schedules p> hopefully one causes a bug
  • Bug types fuzzing exposes: input validation failures, race conditions, boundary/overflow issues, poor error handling, resource leaks, compatibility gaps, security bugs, and performance regressions
  • Pros: simple to start, formats agnostic, can uncover deep/rare failures
  • Cons: uneven coverage without guidance, hard-to-reproduce crashes, and potentially large time/resource costs
  • Depth of a concurrency bug: number of ordering constraints (e.g. if/else’s) a schedule has to satisfy to find a bug

Delta debugging

  • Delta debugging: a method that automatically minimizes a test case that produces a bug
  • Why simplify failing inputs: smaller cases are easier to explain, reproduce faster, reduce state space, and often subsume duplicate reports
  • Global Minimum: the smallest test case that makes a test fail
  • Local Minimum: the smallest test case with respect to itself that causes a test to fail
    • removing one or more items from the failing test case would not create any other smaller test case
  • n-minimal: if making between 1 and n changes from causes the test to no longer fail
  • 1-minimal: if making any single change from c causes the test to no longer fail
  • 2-minimal: if making any 2 changes from c causes the test to no longer fail
  • k-minimal: if making any k changes from c causes the test to no longer fail
    • different than n-minimal, since n-minimal is any 1 to n, and this is exactly k changes
  • Algorithm 1 (linear reduction): removes 1 element at a time and tests
    • slow but simple, finds the smallest failing input by gradually trimming pieces
  • Algorithm 2 (binary search reduction): splits into 2 halves, tests each half, and recursively continues on the failing part
    • faster because it cuts the search space in half each time
  • Algorithm 3 (hierarchical/balanced reduction): if both halves pass, it means the failure needs elements from both sides
    • keep one half while reducing the other, then swap
    • combines both minimal failing parts for the smallest possible failing input
  • Test function: treat the failing configuration as a set of changes and define for each subset—delta debugging repeatedly queries this oracle
  • ddmin loop: partition into disjoint subsets (), test each subset and its complement; if a smaller failing chunk is found, recurse on it, otherwise increase (finer granularity) until 1-minimal