Hone logo
Hone
Problems

Implementing a Basic Additive Attention Mechanism in Python

Attention mechanisms are a fundamental component in modern deep learning, particularly in sequence-to-sequence models like those used for machine translation or text summarization. They allow a model to dynamically focus on specific parts of the input sequence when processing the output. This challenge focuses on implementing a simplified additive (Bahdanau-style) attention mechanism from scratch in Python.

Problem Description

Your task is to implement a class that represents a basic additive attention mechanism. This mechanism will take a query vector and a set of key-value pairs (where keys and values are typically derived from an encoder's hidden states) and produce an attention-weighted context vector.

Key Requirements:

  1. Class Structure: Create a Python class named AdditiveAttention.
  2. Initialization (__init__):
    • The constructor should accept two integer arguments: query_dim (the dimension of the query vector) and key_dim (the dimension of the key vectors).
    • It should initialize the necessary learnable parameters:
      • W_q: A weight matrix for the query.
      • W_k: A weight matrix for the keys.
      • v: A weight vector.
    • You can assume these parameters will be initialized using a standard method (e.g., Xavier initialization), but for this challenge, you can initialize them with dummy NumPy arrays of appropriate shapes for testing purposes.
  3. Forward Pass (forward):
    • The forward method should accept two arguments:
      • query: A NumPy array representing the query vector (shape: (batch_size, query_dim)).
      • keys: A NumPy array representing the key vectors (shape: (batch_size, seq_len, key_dim)).
      • values: A NumPy array representing the value vectors (shape: (batch_size, seq_len, value_dim)), where value_dim can be different from key_dim.
    • The method should perform the following steps:
      • Score Calculation: Compute the attention scores using the additive attention formula: $e_{ij} = v^T \tanh(W_q q_i + W_k k_j)$ where $q_i$ is the $i$-th query, $k_j$ is the $j$-th key, $W_q$ and $W_k$ are weight matrices, and $v$ is a weight vector. You'll need to broadcast the query and keys appropriately.
      • Softmax: Apply a softmax function to the calculated scores along the seq_len dimension to get attention weights ($\alpha_{ij}$). $\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})}$
      • Context Vector Calculation: Compute the weighted sum of the value vectors using the attention weights to produce the context vector: $c_i = \sum_j \alpha_{ij} v_j$ where $v_j$ is the $j$-th value.
    • The forward method should return two NumPy arrays:
      • context_vector: The attention-weighted context vector (shape: (batch_size, value_dim)).
      • attention_weights: The computed attention weights (shape: (batch_size, seq_len)).

Expected Behavior:

  • The forward method should correctly compute the attention scores, apply softmax to get weights, and then produce a context vector that is a weighted average of the input values.
  • The dimensions of the output arrays should match the specifications.

Edge Cases:

  • Consider what happens if seq_len is 1. The softmax should still operate correctly.
  • Ensure that operations are performed element-wise or via broadcasting where appropriate to handle batches.

Examples

Example 1:

import numpy as np

# Dummy parameters (replace with actual initialization for a real model)
query_dim = 10
key_dim = 10
value_dim = 5
batch_size = 2
seq_len = 3

# Initialize dummy weights for demonstration
W_q = np.random.randn(query_dim, query_dim)
W_k = np.random.randn(key_dim, query_dim)
v = np.random.randn(query_dim, 1)

# Dummy input
query = np.random.randn(batch_size, query_dim)
keys = np.random.randn(batch_size, seq_len, key_dim)
values = np.random.randn(batch_size, seq_len, value_dim)

# Expected Output Structure:
# context_vector: (batch_size, value_dim)
# attention_weights: (batch_size, seq_len)

Explanation: This example sets up the dimensions and dummy input for a typical scenario. The AdditiveAttention class will be instantiated with query_dim and key_dim, and the forward method will process the query, keys, and values to produce the context_vector and attention_weights.

Example 2:

# Assume AdditiveAttention class is defined as per requirements

# Input for a single example in the batch
query_single = np.random.randn(1, query_dim)
keys_single = np.random.randn(1, seq_len, key_dim)
values_single = np.random.randn(1, seq_len, value_dim)

# Call forward for this single example
# context_single, weights_single = attention.forward(query_single, keys_single, values_single)
# Expected: context_single.shape == (1, value_dim), weights_single.shape == (1, seq_len)

Explanation: This example highlights processing a batch of size 1, ensuring the mechanism works correctly for individual data points.

Example 3: (Handling different value_dim)

# Input with value_dim different from key_dim
query_dim_ex3 = 8
key_dim_ex3 = 12
value_dim_ex3 = 20
batch_size_ex3 = 1
seq_len_ex3 = 5

# Dummy parameters for this example
W_q_ex3 = np.random.randn(query_dim_ex3, query_dim_ex3)
W_k_ex3 = np.random.randn(key_dim_ex3, query_dim_ex3)
v_ex3 = np.random.randn(query_dim_ex3, 1)

# Dummy input
query_ex3 = np.random.randn(batch_size_ex3, query_dim_ex3)
keys_ex3 = np.random.randn(batch_size_ex3, seq_len_ex3, key_dim_ex3)
values_ex3 = np.random.randn(batch_size_ex3, seq_len_ex3, value_dim_ex3)

# Call forward
# context_ex3, weights_ex3 = attention.forward(query_ex3, keys_ex3, values_ex3)
# Expected: context_ex3.shape == (batch_size_ex3, value_dim_ex3) which is (1, 20)
# Expected: weights_ex3.shape == (batch_size_ex3, seq_len_ex3) which is (1, 5)

Explanation: This example demonstrates that the value_dim does not need to match the key_dim, as the context vector is a weighted sum of the values. The attention scores and weights are still computed based on query_dim and key_dim.

Constraints

  • Library Usage: You are allowed to use numpy for all numerical computations. No deep learning frameworks (like TensorFlow or PyTorch) are allowed for implementing the core attention logic.
  • Input Format: query will be a NumPy array of shape (batch_size, query_dim). keys will be a NumPy array of shape (batch_size, seq_len, key_dim). values will be a NumPy array of shape (batch_size, seq_len, value_dim).
  • Output Format: The forward method must return two NumPy arrays: context_vector of shape (batch_size, value_dim) and attention_weights of shape (batch_size, seq_len).
  • Parameter Initialization: For testing, you can initialize weights using np.random.randn or similar. In a real application, these would be trainable parameters.
  • Batch Processing: The implementation must handle batched inputs correctly.

Notes

  • The additive attention mechanism is also known as Bahdanau attention.
  • Pay close attention to the dimensions during matrix multiplications and broadcasting operations. NumPy's np.tanh and np.exp functions will be useful.
  • The softmax needs to be applied along the seq_len dimension for each example in the batch. np.sum with axis parameter and keepdims=True will be helpful for normalization.
  • Think about how to efficiently compute $W_q q_i + W_k k_j$ for all $i$ and $j$ across the batch. Broadcasting will be your friend here.
Loading editor...
python