HTML Sanitizer in JavaScript

Creating an HTML sanitizer is crucial for web applications to prevent Cross-Site Scripting (XSS) vulnerabilities. This challenge asks you to build a JavaScript function that takes an HTML string as input and returns a sanitized version, removing potentially harmful elements and attributes while preserving safe content. This is a fundamental security practice for any application accepting user-provided HTML.

Problem Description

You need to implement a JavaScript function called sanitizeHTML that takes an HTML string as input and returns a sanitized version of that string. The sanitizer should remove potentially dangerous HTML elements and attributes, while allowing a predefined set of safe elements and attributes.

What needs to be achieved:

The function should parse the input HTML string.
It should identify and remove any HTML elements and attributes that are not explicitly allowed.
It should return a new HTML string containing only the allowed elements and attributes, with their content preserved.

Key Requirements:

Allowed Elements: p, b, i, u, em, strong, a, img, br, ul, ol, li, span, div, h1, h2, h3, h4, h5, h6.
Allowed Attributes:
- For a elements: href (must be a valid URL - see edge cases)
- For img elements: src (must be a valid URL - see edge cases), alt
- For all other allowed elements: No attributes are allowed.
URL Validation: The href attribute of <a> tags and the src attribute of <img> tags must be validated to ensure they are valid URLs. A simple check is to ensure the URL starts with http:// or https://. Invalid URLs should result in the attribute being removed.
Attribute Value Sanitization: Attribute values should be escaped to prevent injection attacks. For simplicity, replace < with < and > with > in attribute values.

Expected Behavior:

The function should return a string containing only the allowed HTML elements and attributes, with any potentially harmful elements and attributes removed. The content within the allowed elements should be preserved.

Edge Cases to Consider:

Empty input string.
Input string containing only whitespace.
Input string containing only disallowed HTML elements.
Input string containing nested HTML elements.
Invalid URLs in href and src attributes.
HTML entities already present in the input string.
Self-closing tags (e.g., <br />).
Comments within the HTML. Comments should be removed.

Examples

Example 1:

Input: "<p>This is <b>bold</b> text with a <a href='https://www.example.com'>link</a> and an <img src='https://www.example.com/image.jpg' alt='An image'/>.</p>"
Output: "<p>This is <b>bold</b> text with a <a href='https://www.example.com'>link</a> and an <img src='https://www.example.com/image.jpg' alt='An image'/>.</p>"
Explanation: All elements and attributes are allowed and valid.

Example 2:

Input: "<p>This is <b>bold</b> text with a <a href='javascript:alert("XSS")'>link</a> and an <img src='invalid-url' alt='An image'/>.</p><script>alert('XSS')</script>"
Output: "<p>This is <b>bold</b> text with a <a></a> and an <img alt='An image'/>.</p>"
Explanation: The `javascript:` URL in the `<a>` tag and the invalid URL in the `<img>` tag are removed. The `<script>` tag is also removed.

Example 3:

Input: "<div><p>Hello</p><script>alert('XSS')</script></div>"
Output: "<div><p>Hello</p></div>"
Explanation: The `<script>` tag is removed.

Constraints

Input Size: The input HTML string can be up to 10,000 characters long.
Performance: The function should complete within 500 milliseconds for typical input strings.
Input Format: The input will always be a string.
Output Format: The output must be a valid HTML string.

Notes

You can use regular expressions or a DOM parser to parse the HTML string. Using a DOM parser (like DOMParser in browsers or a library like jsdom in Node.js) is generally recommended for more robust and accurate parsing.
Consider using a whitelist approach, explicitly allowing only the desired elements and attributes.
Be mindful of HTML entities and ensure they are handled correctly.
Thoroughly test your sanitizer with various inputs, including edge cases, to ensure its effectiveness.
This is a simplified sanitizer. Real-world sanitizers are significantly more complex and may involve more sophisticated techniques to prevent XSS vulnerabilities.