CPU-Specific Vectorized Summation in Rust

This challenge focuses on leveraging Rust's conditional compilation features to optimize a vector summation function based on the target CPU architecture. Modern CPUs often have SIMD (Single Instruction, Multiple Data) capabilities that can significantly speed up numerical computations. Your task is to write a function that utilizes SIMD instructions when available (specifically, AVX2) and falls back to a scalar implementation when SIMD is not supported, demonstrating CPU-specific optimization.

Problem Description

You are tasked with creating a function vector_sum that calculates the sum of a vector of f32 (single-precision floating-point numbers). The function should be optimized for CPUs that support AVX2 instructions. If AVX2 is not available, the function should gracefully fall back to a scalar (non-SIMD) implementation.

What needs to be achieved:

Implement a function vector_sum that takes a slice of f32 as input and returns the sum of all elements as an f32.
Utilize AVX2 instructions for vectorized summation when the target CPU supports them.
Provide a scalar implementation as a fallback for CPUs without AVX2 support.
Ensure the function is correct and efficient for both AVX2-enabled and non-AVX2 CPUs.

Key Requirements:

Use Rust's conditional compilation attributes (#[cfg(target_feature = "avx2")]) to detect AVX2 support.
Employ the std::arch::x86_64 module for AVX2 intrinsics.
The scalar implementation should be a straightforward iterative summation.
The function signature must be fn vector_sum(data: &[f32]) -> f32.

Expected Behavior:

When AVX2 is enabled at compile time, the function should use AVX2 intrinsics to perform vectorized summation.
When AVX2 is disabled at compile time, the function should use the scalar implementation.
The function should return the correct sum regardless of the CPU architecture.
The AVX2 implementation should be measurably faster than the scalar implementation for sufficiently large input vectors.

Edge Cases to Consider:

Empty input vector: Should return 0.0.
Vector with a single element: Should return that element.
Large vectors: The performance difference between AVX2 and scalar implementations should be noticeable.

Examples

Example 1:

Input: [1.0, 2.0, 3.0, 4.0, 5.0]
Output: 15.0
Explanation: The sum of the elements in the input vector.

Example 2:

Input: []
Output: 0.0
Explanation: The sum of an empty vector is 0.

Example 3:

Input: [3.14]
Output: 3.14
Explanation: The sum of a vector with a single element is that element.

Constraints

The input vector data will contain f32 values.
The length of the input vector data can range from 0 to 100,000.
Performance: The AVX2 implementation should demonstrate a significant speedup (at least 2x) compared to the scalar implementation for vectors of length 1000 or greater on an AVX2-enabled CPU. This is a qualitative constraint; precise timing is not required for evaluation.
The code must compile and run without errors on both AVX2-enabled and non-AVX2-enabled systems.

Notes

You'll need to use the std::arch::x86_64 module to access AVX2 intrinsics. Familiarize yourself with the _mm256_loadu_ps and _mm256_add_ps intrinsics.
Remember to handle alignment issues when using SIMD instructions. While this problem doesn't explicitly require perfect alignment, be aware of its implications. Using _mm256_loadu_ps avoids alignment issues, but may be slightly slower than aligned loads.
Consider how to efficiently handle the remaining elements in the vector after processing chunks with AVX2. A simple loop can be used for this.
Conditional compilation is key to making this solution adaptable to different CPU architectures.
Focus on clarity and correctness first, then optimize for performance.