Hone logo
Hone
Problems

JavaScript HTML Sanitizer Challenge

Creating an HTML sanitizer is crucial for web security, preventing malicious code injection (like cross-site scripting or XSS attacks) when users submit HTML content. This challenge asks you to build a robust JavaScript function that cleanses user-provided HTML, allowing only a safe subset of tags and attributes.

Problem Description

Your task is to implement a JavaScript function named sanitizeHTML that takes a string containing HTML as input and returns a sanitized string. The sanitizer should remove any potentially harmful HTML elements and attributes while preserving a predefined set of safe tags and their attributes.

Key Requirements:

  1. Allowlist Approach: Define a strict allowlist of HTML tags that are permitted. Any tag not on this list should be removed.
  2. Attribute Filtering: For allowed tags, only specific attributes should be permitted. All other attributes must be removed.
  3. Attribute Value Sanitization: For allowed attributes, their values might also need basic sanitization (e.g., removing JavaScript event handlers).
  4. Content Preservation: The text content within allowed tags should be preserved.
  5. Handle Nesting: The sanitizer should correctly handle nested HTML structures.
  6. Basic HTML Entities: Preserve common HTML entities like &, <, >, ", '.

Expected Behavior:

  • Unrecognized tags and their content (unless content is plain text) should be stripped.
  • Allowed tags with disallowed attributes should have those attributes removed.
  • Disallowed attributes on allowed tags should be removed.
  • Attributes with potentially dangerous values (like javascript: URLs in href or src) should be handled safely.
  • Valid HTML structure should be maintained as much as possible.

Edge Cases to Consider:

  • Empty input string.
  • Input with only text.
  • Input with malformed HTML (e.g., unclosed tags).
  • Attributes containing special characters or malformed values.
  • Case sensitivity of tags and attributes.

Examples

Example 1:

Input: "<p>This is a <strong>safe</strong> paragraph with a <a href='https://www.example.com'>link</a>.</p><script>alert('XSS!');</script>"
Output: "<p>This is a <strong>safe</strong> paragraph with a <a href='https://www.example.com'>link</a>.</p>"
Explanation: The `<script>` tag and its content are removed because it's not in the allowlist. The `<strong>` tag, `<a>` tag, `href` attribute, and plain text content are preserved.

Example 2:

Input: "<div id='container' style='color: blue;' onclick='console.log(1)'>Hello <img src='valid.jpg' onerror='alert(2)'> World</div>"
Output: "<div>Hello <img src='valid.jpg'> World</div>"
Explanation: The `id` and `style` attributes on `div` are removed as they are not in the allowlist. The `onclick` attribute is removed as it's a potentially unsafe event handler. The `onerror` attribute on `img` is removed. The `src` attribute on `img` is kept.

Example 3:

Input: "<p>Text with &lt;b&gt;bold&lt;/b&gt; entities.</p> <img src='path/to/image.png'>"
Output: "<p>Text with &lt;b&gt;bold&lt;/b&gt; entities.</p> <img src='path/to/image.png'>"
Explanation: HTML entities like `&lt;` and `&gt;` are preserved. The `<img>` tag is allowed.

Constraints

  • The sanitizer function must be implemented in plain JavaScript. No external libraries (e.g., DOMPurify) are allowed.
  • The input HTML string can be up to 5000 characters long.
  • The function should return a string.
  • The solution should aim for reasonable performance, avoiding excessively complex or inefficient algorithms that would significantly slow down processing for typical input sizes.

Notes

  • Consider how to parse the HTML. You might need to use regular expressions or a more robust parsing strategy if you want to handle malformed HTML perfectly, but for this challenge, focus on cleaning well-formed or mostly well-formed HTML.
  • Define your allowlist of tags and attributes. A good starting point might be common tags like p, strong, em, br, a, img, ul, ol, li, h1 to h6, and attributes like href for a and src for img.
  • Think about how to handle attributes that might contain malicious code, especially href attributes starting with javascript:.
  • Remember to consider case-insensitivity for HTML tags and attribute names.
Loading editor...
javascript