How Modern Compilers Work: From Code to Machine Instructions

Every line of code you write — whether in C++, Rust, Go, or TypeScript — eventually becomes machine instructions that a processor executes. But the journey from human-readable source code to binary machine code is one of computer science's most sophisticated transformations. Modern compilers like GCC, LLVM, V8 (JavaScript), and Roslyn (C#) perform hundreds of complex analyses and optimizations to produce fast, secure, and efficient executables.

This post explores the internal architecture of modern compilers. We will trace the complete pipeline: lexical analysis (scanning), parsing (syntax analysis), semantic analysis, intermediate representation (IR) generation, optimization, and final code generation. We will also examine the difference between ahead-of-time (AOT) compilation used by C++ and just-in-time (JIT) compilation used by JavaScript and Java.

The Compilation Pipeline Overview

All traditional compilers follow a multi-stage pipeline. Each stage transforms the program representation, gradually moving from human-friendly text to machine-friendly bytes.

Source Code → Lexical Analysis → Parsing → Semantic Analysis → 
IR Generation → Optimization → Code Generation → Machine Code

Modern compilers often repeat optimization stages and maintain multiple intermediate representations. LLVM, for example, uses over 20 different optimization passes on its IR before generating assembly.

Lexical Analysis: Tokenization

The first stage converts a raw string of characters into a sequence of tokens. Tokens are meaningful chunks: keywords (if, while, return), identifiers (count, total), literals (42, "hello"), operators (+, *, =), and punctuation (;, {, }).

The lexer (also called scanner or tokenizer) reads characters one by one, applying pattern matching rules. Most modern lexers use regular expressions or finite automata generated by tools like Lex, Flex, or handwritten recursive descent.

// Example source code
int sum = a + b * 10;

// Corresponding tokens
[KEYWORD_INT]    "int"
[IDENTIFIER]     "sum"
[OPERATOR_ASSIGN] "="
[IDENTIFIER]     "a"
[OPERATOR_PLUS]  "+"
[IDENTIFIER]     "b"
[OPERATOR_MUL]   "*"
[LITERAL_INT]    "10"
[SEMICOLON]      ";"

A simplified lexer implementation:

class Lexer {
  constructor(source) {
    this.source = source;
    this.position = 0;
    this.tokens = [];
  }
  
  tokenize() {
    while (this.position < this.source.length) {
      const char = this.source[this.position];
      
      // Skip whitespace
      if (/\s/.test(char)) {
        this.position++;
        continue;
      }
      
      // Match keywords and identifiers
      if (/[a-zA-Z_]/.test(char)) {
        let value = '';
        while (this.position < this.source.length && /[a-zA-Z0-9_]/.test(this.source[this.position])) {
          value += this.source[this.position];
          this.position++;
        }
        
        const keywords = ['if', 'else', 'while', 'return', 'int', 'float'];
        const type = keywords.includes(value) ? 'KEYWORD' : 'IDENTIFIER';
        this.tokens.push({ type, value });
        continue;
      }
      
      // Match numbers
      if (/[0-9]/.test(char)) {
        let value = '';
        while (this.position < this.source.length && /[0-9]/.test(this.source[this.position])) {
          value += this.source[this.position];
          this.position++;
        }
        this.tokens.push({ type: 'NUMBER', value: parseInt(value) });
        continue;
      }
      
      // Match operators
      const operators = { '+': 'PLUS', '-': 'MINUS', '*': 'MULTIPLY', '/': 'DIVIDE', '=': 'ASSIGN' };
      if (operators[char]) {
        this.tokens.push({ type: operators[char], value: char });
        this.position++;
        continue;
      }
      
      // Match punctuation
      if (char === ';') {
        this.tokens.push({ type: 'SEMICOLON', value: ';' });
        this.position++;
        continue;
      }
      
      throw new Error(`Unexpected character: ${char}`);
    }
    return this.tokens;
  }
}

Parsing: Syntax Analysis

The parser consumes tokens from the lexer and builds an Abstract Syntax Tree (AST) — a tree representation of the program's grammatical structure. Parsers verify that the token sequence follows the language's grammar rules (e.g., parentheses match, expressions are valid).

Most parsers fall into two categories:

Top-down parsers (recursive descent) start from the root grammar rule and recursively match productions. They are intuitive to write by hand and provide excellent error messages. Most production compilers use recursive descent with lookahead (LL parsing).

Bottom-up parsers start from the tokens and reduce them to grammar rules. They are more powerful but harder to debug. Yacc and Bison generate bottom-up parsers.

class Parser {
  constructor(tokens) {
    this.tokens = tokens;
    this.position = 0;
  }
  
  peek() {
    return this.tokens[this.position];
  }
  
  consume(expectedType) {
    const token = this.peek();
    if (!token || token.type !== expectedType) {
      throw new Error(`Expected ${expectedType}, got ${token?.type}`);
    }
    this.position++;
    return token;
  }
  
  // Grammar: program = statement*
  parseProgram() {
    const statements = [];
    while (this.position < this.tokens.length) {
      statements.push(this.parseStatement());
    }
    return { type: 'Program', statements };
  }
  
  // Grammar: statement = variableDeclaration | expressionStatement
  parseStatement() {
    if (this.peek().type === 'KEYWORD' && this.peek().value === 'int') {
      return this.parseVariableDeclaration();
    }
    const expr = this.parseExpression();
    this.consume('SEMICOLON');
    return { type: 'ExpressionStatement', expression: expr };
  }
  
  // Grammar: variableDeclaration = 'int' IDENTIFIER '=' expression ';'
  parseVariableDeclaration() {
    this.consume('KEYWORD'); // 'int'
    const name = this.consume('IDENTIFIER').value;
    this.consume('ASSIGN');
    const value = this.parseExpression();
    this.consume('SEMICOLON');
    return { type: 'VariableDeclaration', name, value };
  }
  
  // Grammar: expression = addition
  parseExpression() {
    return this.parseAddition();
  }
  
  // Grammar: addition = multiplication ('+' multiplication)*
  parseAddition() {
    let left = this.parseMultiplication();
    while (this.peek()?.type === 'PLUS') {
      this.consume('PLUS');
      const right = this.parseMultiplication();
      left = { type: 'BinaryExpression', operator: '+', left, right };
    }
    return left;
  }
  
  // Grammar: multiplication = primary ('*' primary)*
  parseMultiplication() {
    let left = this.parsePrimary();
    while (this.peek()?.type === 'MULTIPLY') {
      this.consume('MULTIPLY');
      const right = this.parsePrimary();
      left = { type: 'BinaryExpression', operator: '*', left, right };
    }
    return left;
  }
  
  // Grammar: primary = NUMBER | IDENTIFIER
  parsePrimary() {
    const token = this.peek();
    if (token.type === 'NUMBER') {
      this.consume('NUMBER');
      return { type: 'NumberLiteral', value: token.value };
    }
    if (token.type === 'IDENTIFIER') {
      this.consume('IDENTIFIER');
      return { type: 'Identifier', name: token.value };
    }
    throw new Error(`Unexpected token: ${token.type}`);
  }
}

For int sum = a + b * 10;, the parser generates this AST:

Program
└── VariableDeclaration (name: "sum")
    └── BinaryExpression (operator: "+")
        ├── Identifier ("a")
        └── BinaryExpression (operator: "*")
            ├── Identifier ("b")
            └── NumberLiteral (10)

Semantic Analysis

Parsing verifies syntax but not meaning. Semantic analysis checks that the program makes sense: variables are defined before use, types are compatible, functions exist, and scopes are respected. This stage typically uses symbol tables and type checking.

class SemanticAnalyzer {
  constructor() {
    this.symbolTable = new Map(); // name → { type, kind }
    this.errors = [];
  }
  
  analyze(node) {
    if (node.type === 'Program') {
      for (const stmt of node.statements) {
        this.analyze(stmt);
      }
    }
    
    if (node.type === 'VariableDeclaration') {
      // Check if variable already declared in current scope
      if (this.symbolTable.has(node.name)) {
        this.errors.push(`Variable '${node.name}' already declared`);
      }
      
      // Analyze initializer expression
      const valueType = this.analyzeExpression(node.value);
      
      // Type checking (int = int)
      if (valueType !== 'int') {
        this.errors.push(`Cannot initialize int with ${valueType}`);
      }
      
      this.symbolTable.set(node.name, { type: 'int', kind: 'variable' });
      return 'int';
    }
    
    if (node.type === 'ExpressionStatement') {
      return this.analyzeExpression(node.expression);
    }
  }
  
  analyzeExpression(expr) {
    if (expr.type === 'NumberLiteral') {
      return 'int';
    }
    
    if (expr.type === 'Identifier') {
      if (!this.symbolTable.has(expr.name)) {
        this.errors.push(`Variable '${expr.name}' not declared`);
        return 'unknown';
      }
      return this.symbolTable.get(expr.name).type;
    }
    
    if (expr.type === 'BinaryExpression') {
      const leftType = this.analyzeExpression(expr.left);
      const rightType = this.analyzeExpression(expr.right);
      
      if (leftType !== 'int' || rightType !== 'int') {
        this.errors.push(`Binary operator ${expr.operator} requires integers`);
      }
      return 'int';
    }
  }
}

Intermediate Representation (IR)

After semantic analysis, compilers convert the AST into an Intermediate Representation (IR). IR is platform-agnostic, simpler than source code, and optimized for analysis and transformation. LLVM's IR is the industry gold standard.

LLVM IR example for sum = a + b * 10:

; LLVM IR
define i32 @compute(i32 %a, i32 %b) {
entry:
  %mul = mul i32 %b, 10
  %sum = add i32 %a, %mul
  ret i32 %sum
}

IR provides several advantages:

  • Single Static Assignment (SSA) form means each variable is assigned exactly once. This simplifies optimizations because data flow is explicit.

  • Infinite virtual registers free compilers from hardware constraints during optimization.

  • Explicit control flow using basic blocks and branch instructions.

Optimization

Optimization is where modern compilers distinguish themselves. Over 50-70% of a compiler's code handles optimization. Transformations run repeatedly, each improving the program's speed, size, or power consumption.

Constant Folding

Compute constant expressions at compile time instead of runtime:

// Before optimization
x = 5 * 3 + 2 * 4

// After constant folding
x = 23

Constant Propagation

Replace variable references with known constant values:

// Before
const int x = 42;
int y = x + 10;

// After
int y = 52;  // x eliminated

Dead Code Elimination

Remove code that never executes or whose results are never used:

// Before
int x = 10;
int y = 20;
if (false) {
  printf("Never runs");
}
return x;

// After
int x = 10;
return x;  // y and if-block removed

Loop Optimization

Loop unrolling duplicates loop bodies to reduce overhead:

// Before
for (int i = 0; i < 4; i++) {
  arr[i] *= 2;
}

// After (unrolled)
arr[0] *= 2;
arr[1] *= 2;
arr[2] *= 2;
arr[3] *= 2;

Loop invariant code motion hoists calculations that don't change:

// Before
for (int i = 0; i < n; i++) {
  arr[i] = arr[i] + sqrt(x) * 2;  // sqrt(x)*2 computed 1000x
}

// After
const int factor = sqrt(x) * 2;   // Computed once
for (int i = 0; i < n; i++) {
  arr[i] = arr[i] + factor;
}

Inlining

Replace function calls with the function body:

// Before
int square(int x) { return x * x; }
int result = square(5) + square(6);

// After (inlined)
int result = (5 * 5) + (6 * 6);

GCC and Clang have over 200 optimization flags. -O2 (standard optimizations) and -O3 (aggressive optimizations) enable different optimization passes.

Code Generation

The final stage converts optimized IR into machine code for a specific architecture (x86-64, ARM, RISC-V). Code generation involves:

Instruction selection maps IR operations to CPU instructions:

; LLVM IR
%sum = add i32 %a, %b

; x86-64 assembly
add eax, ebx

; ARM64 assembly
add w0, w0, w1

Register allocation assigns IR virtual registers to physical CPU registers (x86-64 has 16, ARM64 has 31). Spilling moves values to memory when registers are exhausted.

Instruction scheduling reorders instructions to avoid pipeline stalls:

; Suboptimal (stalls on multiply)
mov eax, [memory]
imul eax, 5
add eax, ebx
mov [result], eax

; Better (independent work during multiply)
mov eax, [memory]
mov ecx, ebx    ; Work while multiply runs
imul eax, 5
add eax, ecx
mov [result], eax

AOT vs JIT Compilation

Ahead-of-Time (AOT) compilation (C, C++, Rust, Go) translates all code to machine instructions before execution. Benefits: fast startup, no runtime overhead, predictable performance. Drawbacks: longer build times, platform-specific binaries, harder optimization (unknown runtime data).

Just-in-Time (JIT) compilation (JavaScript, Java, C#) compiles code at runtime, typically in stages:

// V8 JavaScript engine pipeline (simplified)
Source Code → Parser → AST → Ignition (Interpreter) → 
Bytecode → Execution → Profiling → Hot Code → TurboFan (Optimizing JIT) → 
Optimized Machine Code

// Cold code (executes once) stays interpreted
function rarelyUsed() { return 1 + 1; }

// Hot code (executes many times) gets optimized
for (let i = 0; i < 1000000; i++) {
  sum += Math.sqrt(i);  // sqrt becomes optimized machine code
}

V8's TurboFan compiler uses speculative optimization — it assumes observed types will continue:

// V8 sees: add is always integers
function add(a, b) {
  return a + b;
}

// Optimized version assumes integer addition
// If a string appears later, deoptimization occurs (slower)

JIT advantages: profile-guided optimization (optimize hot paths based on actual usage), cross-platform (same bytecode runs everywhere), runtime code generation. Disadvantages: warmup time (code gets faster over time), memory overhead, security risks (executable memory pages).

Tiered Compilation in Java (HotSpot JVM)

Java Source → javac → Bytecode (.class) → Interpreter → 
C1 Compiler (fast, low optimization) → Profile Collection → 
C2 Compiler (slow, aggressive optimization) → Native Code

Real-World Compiler Architecture: LLVM

LLVM is the most successful modern compiler infrastructure. Its three-phase design separates frontend, optimizer, and backend:

Clang (C++ Frontend) → LLVM IR → Optimizer (30+ passes) → 
LLVM IR → X86 Backend → Machine Code
                         → ARM Backend → Machine Code
                         → WebAssembly → .wasm

This design allows LLVM to support 40+ frontends (Clang for C/C++, rustc for Rust, flang for Fortran, etc.) and 20+ backends (x86, ARM, RISC-V, WebAssembly, SPARC, etc.).

How JavaScript Works: A Case Study

JavaScript is neither purely compiled nor purely interpreted. Modern engines use multiple execution tiers:

Parser generates AST and scopes.

Ignition (interpreter) generates bytecode and executes it. This fast startup gets code running immediately.

Profiler watches executing bytecode, counting operations and tracking types.

TurboFan (optimizing compiler) recompiles hot functions to optimized machine code using speculative assumptions.

Deoptimization rolls back to interpreter when assumptions break.

// Example: function becomes hot and gets optimized
function addVectors(a, b) {
  return { x: a.x + b.x, y: a.y + b.y };
}

// Called 10,000+ times with Vector objects
// TurboFan speculatively optimizes for Vector
for (let i = 0; i < 100000; i++) {
  result = addVectors(vec1, vec2);
}

// If a different type appears, deoptimization occurs
addVectors(1, 2);  // Triggers deoptimization back to interpreter

Final Thoughts

Modern compilers are among the most complex software systems ever built. GCC has over 15 million lines of code. LLVM exceeds 10 million. Each stage — lexical analysis, parsing, semantic analysis, IR generation, optimization, code generation — contains decades of research and engineering.

Understanding compilers makes you a better programmer. You write loop invariant code because you know compilers hoist it. You use local variables because you know register allocation prefers them. You understand that JavaScript's performance varies because of JIT compilation tiers and deoptimization.

The next time you run gcc main.c or node app.js, remember: you are witnessing one of computer science's greatest achievements — the automatic transformation of human logic into machine execution, billions of times per second, on billions of devices worldwide.

Related Posts

Understanding Event Loop in Node.js at a Deep Level

The event loop is the core mechanism that enables Node.js to handle asynchronous operations efficiently despite being single-threaded. It allows Node.js to perform non-blocking I/O operations — such

Read More

Scalable System Design: Building Millions-User Applications

Building an application for a hundred users is trivial. Building for a million users requires a fundamental shift in architecture, mindset, and tooling. Scalability is not just about adding more serv

Read More

Advanced Authentication Systems: JWT, OAuth2, and Session Security

Authentication is the gateway to every application. Get it wrong, and attackers walk through your front door. Yet despite being a foundational security control, authentication remains one of the most

Read More