Hone logo
Hone
Problems

Implementing a Simplified Transformer Model in Python

This challenge asks you to implement a simplified version of the Transformer model, a cornerstone of modern natural language processing. Transformers have revolutionized tasks like machine translation, text generation, and question answering by leveraging self-attention mechanisms to understand relationships between words in a sequence. This exercise will focus on building the core components of a Transformer, excluding training and optimization for brevity.

Problem Description

You are tasked with implementing the key components of a Transformer model: the Self-Attention mechanism and a simplified Encoder block. The goal is to create a functional, albeit simplified, Transformer architecture capable of processing an input sequence and producing a contextualized representation. You will not be implementing the Decoder or positional encoding in this challenge, focusing solely on the Encoder's Self-Attention and a single Encoder layer.

What needs to be achieved:

  1. Self-Attention Mechanism: Implement the scaled dot-product attention mechanism. This involves calculating attention weights based on queries, keys, and values, and then applying these weights to the values to produce a weighted sum.
  2. Encoder Block: Create a simplified Encoder block that consists of:
    • A Self-Attention layer (as implemented in step 1).
    • A Feed-Forward Network (a simple two-layer fully connected network with a ReLU activation in between).

Key Requirements:

  • The Self-Attention mechanism should handle multiple attention heads (multi-head attention).
  • The Feed-Forward Network should be a simple, fully connected network.
  • The code should be well-structured and documented.
  • The implementation should be efficient and avoid unnecessary computations.

Expected Behavior:

Given an input sequence (represented as a matrix where each row is a word embedding), the Encoder block should produce a contextualized representation of the sequence. This representation should reflect the relationships between words in the sequence, as captured by the Self-Attention mechanism.

Edge Cases to Consider:

  • Empty input sequence.
  • Input sequence with a single element.
  • Different numbers of attention heads.
  • Different dimensions for queries, keys, and values.

Examples

Example 1:

Input:
sequence = [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], [0.7, 0.8, 0.9]]  # Shape: (3, 3) - 3 words, each with 3-dimensional embedding
num_heads = 2
d_model = 3
d_ff = 6

Output:
encoded_sequence = [[...], [...], [...]] # Shape: (3, 3) - Contextualized representation of the input sequence

Explanation: The input sequence is passed through the Encoder block. The Self-Attention mechanism calculates attention weights and produces a weighted sum of the input embeddings. The Feed-Forward Network then transforms this weighted sum into the final contextualized representation.

Example 2:

Input:
sequence = [[0.1, 0.2], [0.3, 0.4]]  # Shape: (2, 2)
num_heads = 1
d_model = 2
d_ff = 4

Output:
encoded_sequence = [[...], [...]] # Shape: (2, 2)

Explanation: A smaller input sequence with a single attention head. The Encoder block still applies the Self-Attention and Feed-Forward Network to produce a contextualized representation.

Constraints

  • Input Sequence Shape: The input sequence will be a 2D NumPy array of shape (sequence_length, d_model), where sequence_length is the number of words in the sequence and d_model is the embedding dimension.
  • Number of Heads: The number of attention heads (num_heads) will be between 1 and d_model.
  • Feed-Forward Network Dimension: The hidden dimension of the Feed-Forward Network (d_ff) will be greater than d_model.
  • Performance: While optimization is not the primary focus, avoid excessively inefficient implementations. Reasonable performance for sequences of length up to 100 is expected.
  • Libraries: You are allowed to use NumPy for numerical operations. No other external libraries are permitted.

Notes

  • This is a simplified implementation, so positional encoding and the Decoder are not required.
  • Focus on the core Self-Attention and Encoder block logic.
  • Consider using NumPy's broadcasting capabilities to simplify calculations.
  • The scaling factor in the scaled dot-product attention should be the square root of the key dimension (d_k). In this case, d_k is equal to d_model // num_heads.
  • The Feed-Forward Network should be a simple two-layer fully connected network with a ReLU activation function in between.
  • The output shape of the Encoder block should be the same as the input shape (sequence_length, d_model).
Loading editor...
python