JavaScript HTML Sanitizer Challenge
Creating an HTML sanitizer is crucial for web security, preventing malicious code injection (like cross-site scripting or XSS attacks) when users submit HTML content. This challenge asks you to build a robust JavaScript function that cleanses user-provided HTML, allowing only a safe subset of tags and attributes.
Problem Description
Your task is to implement a JavaScript function named sanitizeHTML that takes a string containing HTML as input and returns a sanitized string. The sanitizer should remove any potentially harmful HTML elements and attributes while preserving a predefined set of safe tags and their attributes.
Key Requirements:
- Allowlist Approach: Define a strict allowlist of HTML tags that are permitted. Any tag not on this list should be removed.
- Attribute Filtering: For allowed tags, only specific attributes should be permitted. All other attributes must be removed.
- Attribute Value Sanitization: For allowed attributes, their values might also need basic sanitization (e.g., removing JavaScript event handlers).
- Content Preservation: The text content within allowed tags should be preserved.
- Handle Nesting: The sanitizer should correctly handle nested HTML structures.
- Basic HTML Entities: Preserve common HTML entities like
&,<,>,",'.
Expected Behavior:
- Unrecognized tags and their content (unless content is plain text) should be stripped.
- Allowed tags with disallowed attributes should have those attributes removed.
- Disallowed attributes on allowed tags should be removed.
- Attributes with potentially dangerous values (like
javascript:URLs inhreforsrc) should be handled safely. - Valid HTML structure should be maintained as much as possible.
Edge Cases to Consider:
- Empty input string.
- Input with only text.
- Input with malformed HTML (e.g., unclosed tags).
- Attributes containing special characters or malformed values.
- Case sensitivity of tags and attributes.
Examples
Example 1:
Input: "<p>This is a <strong>safe</strong> paragraph with a <a href='https://www.example.com'>link</a>.</p><script>alert('XSS!');</script>"
Output: "<p>This is a <strong>safe</strong> paragraph with a <a href='https://www.example.com'>link</a>.</p>"
Explanation: The `<script>` tag and its content are removed because it's not in the allowlist. The `<strong>` tag, `<a>` tag, `href` attribute, and plain text content are preserved.
Example 2:
Input: "<div id='container' style='color: blue;' onclick='console.log(1)'>Hello <img src='valid.jpg' onerror='alert(2)'> World</div>"
Output: "<div>Hello <img src='valid.jpg'> World</div>"
Explanation: The `id` and `style` attributes on `div` are removed as they are not in the allowlist. The `onclick` attribute is removed as it's a potentially unsafe event handler. The `onerror` attribute on `img` is removed. The `src` attribute on `img` is kept.
Example 3:
Input: "<p>Text with <b>bold</b> entities.</p> <img src='path/to/image.png'>"
Output: "<p>Text with <b>bold</b> entities.</p> <img src='path/to/image.png'>"
Explanation: HTML entities like `<` and `>` are preserved. The `<img>` tag is allowed.
Constraints
- The sanitizer function must be implemented in plain JavaScript. No external libraries (e.g., DOMPurify) are allowed.
- The input HTML string can be up to 5000 characters long.
- The function should return a string.
- The solution should aim for reasonable performance, avoiding excessively complex or inefficient algorithms that would significantly slow down processing for typical input sizes.
Notes
- Consider how to parse the HTML. You might need to use regular expressions or a more robust parsing strategy if you want to handle malformed HTML perfectly, but for this challenge, focus on cleaning well-formed or mostly well-formed HTML.
- Define your allowlist of tags and attributes. A good starting point might be common tags like
p,strong,em,br,a,img,ul,ol,li,h1toh6, and attributes likehrefforaandsrcforimg. - Think about how to handle attributes that might contain malicious code, especially
hrefattributes starting withjavascript:. - Remember to consider case-insensitivity for HTML tags and attribute names.