Implementing a Simplified Transformer Model in Python
This challenge asks you to implement a simplified version of the Transformer model, a cornerstone of modern natural language processing. Transformers have revolutionized tasks like machine translation, text generation, and question answering by leveraging self-attention mechanisms to understand relationships between words in a sequence. This exercise will focus on building the core components of a Transformer, excluding training and optimization for brevity.
Problem Description
You are tasked with implementing the key components of a Transformer model: the Self-Attention mechanism and a simplified Encoder block. The goal is to create a functional, albeit simplified, Transformer architecture capable of processing an input sequence and producing a contextualized representation. You will not be implementing the Decoder or positional encoding in this challenge, focusing solely on the Encoder's Self-Attention and a single Encoder layer.
What needs to be achieved:
- Self-Attention Mechanism: Implement the scaled dot-product attention mechanism. This involves calculating attention weights based on queries, keys, and values, and then applying these weights to the values to produce a weighted sum.
- Encoder Block: Create a simplified Encoder block that consists of:
- A Self-Attention layer (as implemented in step 1).
- A Feed-Forward Network (a simple two-layer fully connected network with a ReLU activation in between).
Key Requirements:
- The Self-Attention mechanism should handle multiple attention heads (multi-head attention).
- The Feed-Forward Network should be a simple, fully connected network.
- The code should be well-structured and documented.
- The implementation should be efficient and avoid unnecessary computations.
Expected Behavior:
Given an input sequence (represented as a matrix where each row is a word embedding), the Encoder block should produce a contextualized representation of the sequence. This representation should reflect the relationships between words in the sequence, as captured by the Self-Attention mechanism.
Edge Cases to Consider:
- Empty input sequence.
- Input sequence with a single element.
- Different numbers of attention heads.
- Different dimensions for queries, keys, and values.
Examples
Example 1:
Input:
sequence = [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], [0.7, 0.8, 0.9]] # Shape: (3, 3) - 3 words, each with 3-dimensional embedding
num_heads = 2
d_model = 3
d_ff = 6
Output:
encoded_sequence = [[...], [...], [...]] # Shape: (3, 3) - Contextualized representation of the input sequence
Explanation: The input sequence is passed through the Encoder block. The Self-Attention mechanism calculates attention weights and produces a weighted sum of the input embeddings. The Feed-Forward Network then transforms this weighted sum into the final contextualized representation.
Example 2:
Input:
sequence = [[0.1, 0.2], [0.3, 0.4]] # Shape: (2, 2)
num_heads = 1
d_model = 2
d_ff = 4
Output:
encoded_sequence = [[...], [...]] # Shape: (2, 2)
Explanation: A smaller input sequence with a single attention head. The Encoder block still applies the Self-Attention and Feed-Forward Network to produce a contextualized representation.
Constraints
- Input Sequence Shape: The input sequence will be a 2D NumPy array of shape (sequence_length, d_model), where
sequence_lengthis the number of words in the sequence andd_modelis the embedding dimension. - Number of Heads: The number of attention heads (
num_heads) will be between 1 andd_model. - Feed-Forward Network Dimension: The hidden dimension of the Feed-Forward Network (
d_ff) will be greater thand_model. - Performance: While optimization is not the primary focus, avoid excessively inefficient implementations. Reasonable performance for sequences of length up to 100 is expected.
- Libraries: You are allowed to use NumPy for numerical operations. No other external libraries are permitted.
Notes
- This is a simplified implementation, so positional encoding and the Decoder are not required.
- Focus on the core Self-Attention and Encoder block logic.
- Consider using NumPy's broadcasting capabilities to simplify calculations.
- The scaling factor in the scaled dot-product attention should be the square root of the key dimension (
d_k). In this case,d_kis equal tod_model // num_heads. - The Feed-Forward Network should be a simple two-layer fully connected network with a ReLU activation function in between.
- The output shape of the Encoder block should be the same as the input shape (sequence_length, d_model).