Malicious code injection: Is it safe enough to remove script tags by regex?

So I've set up a page where people can submit tutorials. These tutorials are built basically by a TinyMCE editor.

Anyway one could abuse it and just POST their own, non escaped text and insert some malicious <script>.

So my question is: would it be safe enough to remove <script> tags with an regular expression? I would run this regex on my backend, before storing it.

I've found this expression for example

<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>

Answers


No. It's possible they can use multiple-byte characters to bypass your regexp, or use a combination of mismatched opening and closing tags sneakily, creating fake closing script tags, quoting them in attributes, etc.... Don't attempt to parse potentially noisy/malformed HTML with RegEx, use an HTML parsing engine designed to deal with such concerns. See the famous answer on parsing HTML with regex here: RegEx match open tags except XHTML self-contained tags

If you're looking for one, I swear by this PHP library: http://simplehtmldom.sourceforge.net/ It first cleans the document, by converting noise to entities, before taking into account "script", "style", and "textarea" elements which anything found between the opening and closing tag is meant to be text not HTML. Then it parses the result into a DOM structure to can parse much in the same way you can parse a document with the DOM methods in JavaScript. It comes with a "save" method as well, (which will result the string), so after you're done stripping tags in the page, you'll have your modified, well-formed document. The library I have also tested with large data, and when I was using a regexp before with large which was failing to due PHP memory limits being reached with the regexp, this library parsed such documents without memory issues. So I've tested it quite thoroughly and used it on large projects before, it has never let me down -- like built-in PHP functions/classes have with malformed data.

Edit: Since I got a down vote, I suppose I should give an example how to break it:

<scr<script>ipt></scr</script>ipt>alert('XSS!')</script>

Just because the regex is used by jQuery, doesn't make it safe for the server.

Even if you used the "gi" flags, it doesn't matter:

var str="<scr<script>ipt></scr</script>ipt>alert('XSS!')</script>";
str=str.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi,'');
//the "g" flag doesn't help here since you need to start from the beginning, not continue in the middle
alert(str);

But if you used it in a loop, rather than with the "g" flag, you'll get rid of this case I bring up.

Edit 2: If the purpose is sanitizing user-input from all JavaScript concerns, like "onload" and "onclick" properties, why re-invent the wheel? There's http://htmlpurifier.org/ (see the demo)


Instead of regex, why don't you use DOM for that?

$content = "<h1>title</h1><p> test <span>1<!-- regular comment --><script> my script</script></span><script> my script</script></p><script> my script</script> <!--[if IE]><script>alert('XSS');</script><![endif]-->";

// creates a DOMDocument based on your string (without doctype, html and another extra tags), and wraps it in a div
$dom = new DOMDocument();
$dom->loadHTML("<div>{$content}</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

//Removing any comments or conditional comments
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//comment()') as $comment) {
    $comment->parentNode->removeChild($comment);
}

// function to remove any tag
function verifyNodes(DOMNode $node) {
    $removedTags = ['script', 'iframe']; // what tags i want to remove

    foreach ($node->childNodes as $childNode)
    {
        if (in_array($childNode->nodeName, $removedTags)) {
            $childNode->parentNode->removeChild($childNode);
        } elseif ($childNode->hasChildNodes()) {
            verifyNodes($childNode);
        }
    }
}

// calling verifyNodes
verifyNodes($dom);

// get all the content of my first div, and print it
$newContent = $dom->getElementsByTagName('div')->item(0);
foreach ($newContent->childNodes as $childNode) {
    var_dump($dom->saveHTML($childNode));
}

And just like i use nodeName to verify the tag's name, we can also use nodeType if we want to remove other stuff (check the node XML constants list).


If you can use an engine that supports atomic groups, this will probably work. This will parse it most closely as to how a browser would parse script tags.

Find: (?><script(?:(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)|/)>)(?<=/>)|(?><script(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)?>)(?<!/>)[\S\s]*?</script\s*>

Replace: empty string


Formatted:

    # If script tags can be <script .... />
    (?>
         <
         script 
         (?:
              (?:
                   \s+ 
                   (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
              )
           |  / 
         )
         > 
    )
    (?<= /> )
 |  
    # Or, if script tags with content can be <script .... > ... </script>
    (?>
         <
         script 
         (?:
              \s+ 
              (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
         )?
         > 
    )
    (?<! /> )
    [\S\s]*? 
    </script \s* >

Need Your Help