Hone logo
Hone
Problems

Building a Simple Go Compiler: From Source to Machine Code

This challenge involves creating a rudimentary compiler for a simplified programming language written in Go. Compilers are fundamental to software development, translating human-readable code into machine-executable instructions. By building a compiler, you'll gain a deeper understanding of programming language design, abstract syntax trees, and the process of code generation.

Problem Description

Your task is to implement a compiler that takes a program written in a small, custom language (let's call it "MiniLang") and generates equivalent x86-64 assembly code. This compiler will consist of several stages: lexing, parsing, abstract syntax tree (AST) generation, and code generation.

Key Requirements:

  1. Lexing (Tokenization): Convert the MiniLang source code string into a stream of tokens. Tokens represent the smallest meaningful units of the language (e.g., keywords, identifiers, operators, literals).
  2. Parsing: Take the token stream and build an Abstract Syntax Tree (AST). The AST is a hierarchical representation of the program's structure.
  3. AST Traversal and Code Generation: Traverse the AST and generate corresponding x86-64 assembly instructions.
  4. Supported MiniLang Features:
    • Integer variables: let x: i32 = 10;
    • Basic arithmetic operations: +, -, *, /
    • Assignment: x = 5;
    • Simple function definitions and calls:
      fn add(a: i32, b: i32): i32 {
          return a + b;
      }
      
      let result: i32 = add(5, 3);
      
    • Return statements: return expression;
    • Integer literals: 10, 5
    • Identifiers: x, a, b, result
    • Basic type annotations: i32

Expected Behavior:

The compiler should accept a string containing valid MiniLang code and output a string containing valid x86-64 assembly code that, when assembled and linked, would execute the logic of the MiniLang program. For simplicity, assume the output assembly will be for a Linux environment and will focus on generating the code for the main program and any defined functions. You do not need to handle standard library functions or complex control flow like loops or conditionals (beyond function calls and returns).

Edge Cases:

  • Invalid syntax in the MiniLang code (e.g., missing semicolons, mismatched parentheses). The lexer or parser should report an error.
  • Undeclared variables (if you implement variable scoping).
  • Integer overflow (for this challenge, assume standard integer sizes and don't explicitly handle overflow detection unless it naturally occurs).

Examples

Example 1: Simple Assignment and Arithmetic

Input:
let x: i32 = 10 + 5 * 2;
x = x - 3;
; Generated assembly for MiniLang program

section .data
    ; No global data needed for this simple example

section .text
global _start

_start:
    ; let x: i32 = 10 + 5 * 2;
    mov rax, 5          ; Load immediate value 5 into RAX
    mov rbx, 2          ; Load immediate value 2 into RBX
    imul rax, rbx       ; Multiply RAX by RBX (result in RAX: 10)
    add rax, 10         ; Add immediate value 10 to RAX (result in RAX: 20)
    ; Assuming x is stored in memory (e.g., at some hypothetical address)
    ; For simplicity, we'll just leave the result in RAX and proceed

    ; x = x - 3;
    ; Assuming the value for x is still in RAX (from the previous step)
    sub rax, 3          ; Subtract immediate value 3 from RAX (result in RAX: 17)

    ; Exit the program (standard Linux syscall for exit)
    mov rdi, 0          ; Exit code 0 (success)
    mov rax, 60         ; syscall number for exit
    syscall

Explanation: The input MiniLang code declares an integer variable x and initializes it with the result of 10 + 5 * 2. Then, it subtracts 3 from x. The generated assembly performs these operations using mov, imul, add, and sub instructions. The result of the final expression (17) is left in the RAX register. The assembly then includes the standard Linux exit syscall.

Example 2: Function Definition and Call

Input:
fn add(a: i32, b: i32): i32 {
    return a + b;
}

let result: i32 = add(5, 3);
; Generated assembly for MiniLang program

section .data
    ; No global data needed for this simple example

section .text
global _start
extern add_func     ; Declare external function if needed, or define it here

; Define the add function
add_func:
    ; Function prologue
    push rbp
    mov rbp, rsp

    ; Arguments a and b are typically passed in registers (e.g., RDI, RSI)
    ; For this example, assume 'a' is in RDI, 'b' is in RSI
    mov rax, rdi        ; Move 'a' (in RDI) to RAX
    add rax, rsi        ; Add 'b' (in RSI) to RAX

    ; Function epilogue
    mov rsp, rbp
    pop rbp
    ret                 ; Return from function (result in RAX)

_start:
    ; let result: i32 = add(5, 3);
    mov rdi, 5          ; Load immediate value 5 into RDI (first argument for add)
    mov rsi, 3          ; Load immediate value 3 into RSI (second argument for add)
    call add_func       ; Call the add_func. The return value will be in RAX.
    ; The result is now in RAX. We don't assign it to a specific variable in assembly here for simplicity.

    ; Exit the program
    mov rdi, 0          ; Exit code 0 (success)
    mov rax, 60         ; syscall number for exit
    syscall

Explanation: This example defines a function add that takes two integer arguments and returns their sum. The _start section then calls this function with 5 and 3, and the result is returned in RAX. The assembly code includes a separate definition for the add_func and demonstrates how to set up arguments and call the function.

Constraints

  • The MiniLang source code will be provided as a single string.
  • The generated assembly code must be compatible with x86-64 Linux executables.
  • The generated assembly should use standard AT&T or Intel syntax (choose one and stick to it for consistency). For this example, Intel syntax is assumed.
  • Your compiler should aim to be robust against basic syntax errors, reporting them clearly.
  • Performance is not a primary concern for this challenge; correctness and clarity of the generated code are paramount.

Notes

  • This is a simplified compiler. You don't need to implement a full language or handle complex optimizations.
  • Consider using a Go package for lexing (e.g., github.com/antonmedv/expr/lexer) and parsing, or build them from scratch to better understand the process.
  • For code generation, you'll need to understand x86-64 calling conventions (how arguments are passed to functions and how return values are handled). For function calls, the standard Linux x86-64 calling convention passes the first six integer or pointer arguments in registers RDI, RSI, RDX, RCX, R8, and R9. Return values are typically placed in RAX.
  • You might want to represent your AST using Go structs.
  • The goal is to generate valid assembly, not necessarily the most efficient assembly.
  • Think about how you'll manage symbol tables if you decide to implement more complex variable scoping or type checking. For this challenge, you can simplify by assuming variables are global or their scope is limited to the immediate context.
Loading editor...
go