Hone logo
Hone
Problems

Implement a Simplified Transformer Model in Python

This challenge asks you to implement a fundamental building block of modern Natural Language Processing (NLP) models: the Transformer. You will focus on creating the core components of the Transformer architecture, specifically the self-attention mechanism and the feed-forward network, which are crucial for tasks like machine translation, text summarization, and question answering.

Problem Description

Your task is to implement a simplified Transformer encoder layer. This layer will take a sequence of input embeddings and process them using self-attention and a position-wise feed-forward network. You will need to implement the following components:

  1. Scaled Dot-Product Attention: This is the core attention mechanism. It calculates attention scores based on queries, keys, and values, and then uses these scores to produce a weighted sum of the values.
  2. Multi-Head Attention: This mechanism allows the model to jointly attend to information from different representation subspaces at different positions. It consists of multiple "heads" of scaled dot-product attention, where the results are concatenated and linearly transformed.
  3. Position-wise Feed-Forward Network: This is a simple fully connected feed-forward network applied independently to each position in the sequence.
  4. Transformer Encoder Layer: This layer combines multi-head attention and the feed-forward network, along with residual connections and layer normalization.

Key Requirements:

  • Implement each component as a distinct Python class or function.
  • Ensure that your implementation uses a numerical computation library like NumPy for efficiency.
  • The output of the Transformer encoder layer should have the same shape as the input sequence.

Expected Behavior:

Given an input sequence of embeddings, your Transformer encoder layer should produce a transformed sequence where each embedding has been updated based on its relationship with other embeddings in the sequence (via self-attention) and its own content (via the feed-forward network).

Edge Cases:

  • Input sequences of length 1.
  • Batching of sequences (though for this challenge, a single sequence input is sufficient for demonstration).

Examples

Example 1:

import numpy as np

# Assume input_embeddings has shape (sequence_length, embedding_dim)
input_embeddings = np.random.rand(10, 64) # 10 tokens, 64 dimensions per token

# Example parameters for a simplified transformer layer
d_model = 64 # embedding dimension
num_heads = 8 # number of attention heads
d_ff = 128 # dimension of the feed-forward network

# Expected output shape: (10, 64)
# The actual output values will depend on the random initialization of weights
# and the random input data.
# For demonstration, we'll just show the expected shape.
print(f"Input shape: {input_embeddings.shape}")
# output_embeddings = YourTransformerEncoderLayer(input_embeddings, d_model, num_heads, d_ff)
# print(f"Output shape: {output_embeddings.shape}")

Output:

Input shape: (10, 64)
# Expected Output shape: (10, 64)

Explanation: The input is a sequence of 10 embeddings, each with 64 dimensions. The Transformer encoder layer processes this sequence and outputs a transformed sequence of the same shape.

Example 2:

import numpy as np

input_embeddings_short = np.random.rand(1, 32) # Sequence of length 1

# Example parameters
d_model_short = 32
num_heads_short = 4
d_ff_short = 64

print(f"Input shape (short sequence): {input_embeddings_short.shape}")
# output_embeddings_short = YourTransformerEncoderLayer(input_embeddings_short, d_model_short, num_heads_short, d_ff_short)
# print(f"Output shape (short sequence): {output_embeddings_short.shape}")

Output:

Input shape (short sequence): (1, 32)
# Expected Output shape (short sequence): (1, 32)

Explanation: Even with a single token in the sequence, the Transformer encoder layer should still process it, producing an output of the same shape. Self-attention will operate on this single element.

Constraints

  • The input input_embeddings will be a NumPy array of shape (sequence_length, embedding_dim).
  • sequence_length will be between 1 and 512.
  • embedding_dim will be between 32 and 512, and must be divisible by num_heads.
  • num_heads will be an integer between 2 and 16.
  • d_ff (dimension of the feed-forward network inner layer) will be between embedding_dim and 4 * embedding_dim.
  • Your implementation should be reasonably efficient for the given constraints. Avoid extremely naive implementations that would be too slow for larger inputs.

Notes

  • You will need to implement linear transformations (matrix multiplications followed by bias addition).
  • The softmax function will be required for calculating attention weights.
  • Consider the use of np.dot or np.matmul for matrix operations.
  • Residual connections involve adding the input of a sub-layer to its output.
  • Layer normalization helps stabilize training. You will need to implement this.
  • For this challenge, you can assume the input input_embeddings is already processed (e.g., includes positional encodings if you were building a full transformer, but that's not required here).
  • Focus on the architecture and mathematical operations. For a full implementation, you would also need to consider embedding layers, positional encoding, and the decoder.
  • Start by implementing the scaled_dot_product_attention function, then multi_head_attention, then the position_wise_feed_forward_network, and finally combine them into the TransformerEncoderLayer.
Loading editor...
python