Implement a Simplified Transformer Model in Python

This challenge asks you to implement a fundamental building block of modern Natural Language Processing (NLP) models: the Transformer. You will focus on creating the core components of the Transformer architecture, specifically the self-attention mechanism and the feed-forward network, which are crucial for tasks like machine translation, text summarization, and question answering.

Problem Description

Your task is to implement a simplified Transformer encoder layer. This layer will take a sequence of input embeddings and process them using self-attention and a position-wise feed-forward network. You will need to implement the following components:

Scaled Dot-Product Attention: This is the core attention mechanism. It calculates attention scores based on queries, keys, and values, and then uses these scores to produce a weighted sum of the values.
Multi-Head Attention: This mechanism allows the model to jointly attend to information from different representation subspaces at different positions. It consists of multiple "heads" of scaled dot-product attention, where the results are concatenated and linearly transformed.
Position-wise Feed-Forward Network: This is a simple fully connected feed-forward network applied independently to each position in the sequence.
Transformer Encoder Layer: This layer combines multi-head attention and the feed-forward network, along with residual connections and layer normalization.

Key Requirements:

Implement each component as a distinct Python class or function.
Ensure that your implementation uses a numerical computation library like NumPy for efficiency.
The output of the Transformer encoder layer should have the same shape as the input sequence.

Expected Behavior:

Given an input sequence of embeddings, your Transformer encoder layer should produce a transformed sequence where each embedding has been updated based on its relationship with other embeddings in the sequence (via self-attention) and its own content (via the feed-forward network).

Edge Cases:

Input sequences of length 1.
Batching of sequences (though for this challenge, a single sequence input is sufficient for demonstration).

Examples

Example 1:

import numpy as np

# Assume input_embeddings has shape (sequence_length, embedding_dim)
input_embeddings = np.random.rand(10, 64) # 10 tokens, 64 dimensions per token

# Example parameters for a simplified transformer layer
d_model = 64 # embedding dimension
num_heads = 8 # number of attention heads
d_ff = 128 # dimension of the feed-forward network

# Expected output shape: (10, 64)
# The actual output values will depend on the random initialization of weights
# and the random input data.
# For demonstration, we'll just show the expected shape.
print(f"Input shape: {input_embeddings.shape}")
# output_embeddings = YourTransformerEncoderLayer(input_embeddings, d_model, num_heads, d_ff)
# print(f"Output shape: {output_embeddings.shape}")

Output:

Input shape: (10, 64)
# Expected Output shape: (10, 64)

Explanation: The input is a sequence of 10 embeddings, each with 64 dimensions. The Transformer encoder layer processes this sequence and outputs a transformed sequence of the same shape.

Example 2:

import numpy as np

input_embeddings_short = np.random.rand(1, 32) # Sequence of length 1

# Example parameters
d_model_short = 32
num_heads_short = 4
d_ff_short = 64

print(f"Input shape (short sequence): {input_embeddings_short.shape}")
# output_embeddings_short = YourTransformerEncoderLayer(input_embeddings_short, d_model_short, num_heads_short, d_ff_short)
# print(f"Output shape (short sequence): {output_embeddings_short.shape}")

Output:

Input shape (short sequence): (1, 32)
# Expected Output shape (short sequence): (1, 32)

Explanation: Even with a single token in the sequence, the Transformer encoder layer should still process it, producing an output of the same shape. Self-attention will operate on this single element.

Constraints

The input input_embeddings will be a NumPy array of shape (sequence_length, embedding_dim).
sequence_length will be between 1 and 512.
embedding_dim will be between 32 and 512, and must be divisible by num_heads.
num_heads will be an integer between 2 and 16.
d_ff (dimension of the feed-forward network inner layer) will be between embedding_dim and 4 * embedding_dim.
Your implementation should be reasonably efficient for the given constraints. Avoid extremely naive implementations that would be too slow for larger inputs.

Notes

You will need to implement linear transformations (matrix multiplications followed by bias addition).
The softmax function will be required for calculating attention weights.
Consider the use of np.dot or np.matmul for matrix operations.
Residual connections involve adding the input of a sub-layer to its output.
Layer normalization helps stabilize training. You will need to implement this.
For this challenge, you can assume the input input_embeddings is already processed (e.g., includes positional encodings if you were building a full transformer, but that's not required here).
Focus on the architecture and mathematical operations. For a full implementation, you would also need to consider embedding layers, positional encoding, and the decoder.
Start by implementing the scaled_dot_product_attention function, then multi_head_attention, then the position_wise_feed_forward_network, and finally combine them into the TransformerEncoderLayer.

Implement a Simplified Transformer Model in Python

Problem Description

Scaled Dot-Product Attention: This is the core attention mechanism. It calculates attention scores based on queries, keys, and values, and then uses these scores to produce a weighted sum of the values.
Multi-Head Attention: This mechanism allows the model to jointly attend to information from different representation subspaces at different positions. It consists of multiple "heads" of scaled dot-product attention, where the results are concatenated and linearly transformed.
Position-wise Feed-Forward Network: This is a simple fully connected feed-forward network applied independently to each position in the sequence.
Transformer Encoder Layer: This layer combines multi-head attention and the feed-forward network, along with residual connections and layer normalization.

Key Requirements:

Implement each component as a distinct Python class or function.
Ensure that your implementation uses a numerical computation library like NumPy for efficiency.
The output of the Transformer encoder layer should have the same shape as the input sequence.

Expected Behavior:

Edge Cases:

Input sequences of length 1.
Batching of sequences (though for this challenge, a single sequence input is sufficient for demonstration).

Examples

Example 1:

import numpy as np

# Assume input_embeddings has shape (sequence_length, embedding_dim)
input_embeddings = np.random.rand(10, 64) # 10 tokens, 64 dimensions per token

# Example parameters for a simplified transformer layer
d_model = 64 # embedding dimension
num_heads = 8 # number of attention heads
d_ff = 128 # dimension of the feed-forward network

# Expected output shape: (10, 64)
# The actual output values will depend on the random initialization of weights
# and the random input data.
# For demonstration, we'll just show the expected shape.
print(f"Input shape: {input_embeddings.shape}")
# output_embeddings = YourTransformerEncoderLayer(input_embeddings, d_model, num_heads, d_ff)
# print(f"Output shape: {output_embeddings.shape}")

Output:

Input shape: (10, 64)
# Expected Output shape: (10, 64)

Explanation: The input is a sequence of 10 embeddings, each with 64 dimensions. The Transformer encoder layer processes this sequence and outputs a transformed sequence of the same shape.

Example 2:

import numpy as np

input_embeddings_short = np.random.rand(1, 32) # Sequence of length 1

# Example parameters
d_model_short = 32
num_heads_short = 4
d_ff_short = 64

print(f"Input shape (short sequence): {input_embeddings_short.shape}")
# output_embeddings_short = YourTransformerEncoderLayer(input_embeddings_short, d_model_short, num_heads_short, d_ff_short)
# print(f"Output shape (short sequence): {output_embeddings_short.shape}")

Output:

Input shape (short sequence): (1, 32)
# Expected Output shape (short sequence): (1, 32)

Constraints

The input input_embeddings will be a NumPy array of shape (sequence_length, embedding_dim).
sequence_length will be between 1 and 512.
embedding_dim will be between 32 and 512, and must be divisible by num_heads.
num_heads will be an integer between 2 and 16.
d_ff (dimension of the feed-forward network inner layer) will be between embedding_dim and 4 * embedding_dim.
Your implementation should be reasonably efficient for the given constraints. Avoid extremely naive implementations that would be too slow for larger inputs.

Notes

You will need to implement linear transformations (matrix multiplications followed by bias addition).
The softmax function will be required for calculating attention weights.
Consider the use of np.dot or np.matmul for matrix operations.
Residual connections involve adding the input of a sub-layer to its output.
Layer normalization helps stabilize training. You will need to implement this.
For this challenge, you can assume the input input_embeddings is already processed (e.g., includes positional encodings if you were building a full transformer, but that's not required here).
Focus on the architecture and mathematical operations. For a full implementation, you would also need to consider embedding layers, positional encoding, and the decoder.
Start by implementing the scaled_dot_product_attention function, then multi_head_attention, then the position_wise_feed_forward_network, and finally combine them into the TransformerEncoderLayer.