04 Oct

You are probably misusing DOM text methods

This is a public service announcement.

element.textContent and document.createTextNode are dangerous DOM methods. They can easily lead to XSS vulnerabilities in your application.

We’ll find out how by looking at two common – but flawed – applications of these methods: stripping and escaping HTML.

Stripping tags with element.textContent

What does element.textContent do? (element.innerText in old IE) From MDN:

Gets or sets the text content of a node and its descendants.

Here’s an example:

var div = document.createElement('div');
div.innerHTML = 'Hello <a href="http://bob.com">Bob</a>!';
console.log(div.textContent);
// Hello Bob!;

Many developers use this method to strip HTML from a string, as I’ve just done here.

But what most people don’t realize is that this will also convert HTML entities into regular unicode characters:

var div = document.createElement('div');
div.innerHTML = 
  'Hello <a>&lt;script&gt;alert(&quot;!&quot;)&lt;/script&gt;</a>!';
console.log(div.textContent);
// Hello <script>alert("!")</script>!

My purpose here was to strip the <a> tag from this string. But in the process, I created wholly new tags. Possibly malicious ones.

If you were to ever use element.textContent to strip previously-escaped user content, and insert that seemingly-HTML-safe content somewhere else, you’ve just introduced an XSS vulnerability (example).

Now if you’re smart, you probably recognize that this should be escaped a second time. But I suspect many people are unaware of this vulnerability, and probably didn’t. I’ll be the first to admit it – I’ve made this mistake before. I suspect you or members of your team have too.

Here’s a modified helper function from Stack Overflow which escapes the final “stripped” output:

function strip(html) {
   var tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return escapeHtml(tmp.textContent||tmp.innerText);
}

Note that this function is not guaranteed to preserve the input. By placing the content in a DOM element – even temporarily – the browser will attempt to be helpful by “fixing” incomplete entities. If you want this to be perfect, you’ll need a pure HTML parser.

I’ve left the escapeHtml implementation blank for now, but you can imagine what it does: convert dangerous characters used in script injection into harmless HTML entities.

But wait a minute. Did you know DOM text nodes can also be used to escape HTML? Let’s find out more.

Escaping HTML with text nodes

We just looked at how accessing element.textContent returns the text content – without tags – of an element and its children.

But what if we do the opposite, and assign a string – with tags – to element.textContent?

var div = document.createElement('div');
div.textContent = '<span>Foo & bar</span>';
console.log(div.innerHTML)
// &lt;span&gt;Foo &amp; bar&lt;/span&gt;

Cool! By assigning HTML to element.textContent, it escaped the < > and & characters.

The same can be done by using document.createTextNode:

var div = document.createElement('div');
div.appendChild(document.createTextNode('<span>Foo & bar</span>'));
console.log(div.innerHTML)
// &lt;span&gt;Foo &amp; bar&lt;/span&gt;

It almost seems that the browser knows what characters are safe and which aren’t. Which is why many developers have taken to using this technique to escape their user strings.

Here’s an example implementation pulled from the web:

function escapeHtml(str) {
    var div = document.createElement('div');
    div.appendChild(document.createTextNode(str));
    return div.innerHTML;
};

And here‘s that function in action:

var username = '<img src="herp:/" onerror=alert("derp")>';
var profileLink = '<a href="/profile">' + escapeHtml(username) + '</a>';
var div = document.getElementById('target');
div.innerHTML = profileLink;
// <a href="/profile">&lt;img src="herp:/" onerror=alert("derp")&gt;</a>

In this situation, escapeHtml is effective. It prevents the malicious img tag from being inserted.

But if you paid close attention, you’ll notice that it didn’t escape every character. Both double-quotes and single-quotes were left untouched. And not escaping these characters can lead to XSS vulnerabilities when assigning a value to an attribute:

var userWebsite = '" onmouseover="alert(\'derp\')" "';
var profileLink = '<a href="' + escapeHtml(userWebsite) + '">Bob</a>';
var div = document.getElementById('target');
div.innerHtml = profileLink;
// <a href="" onmouseover="alert('derp')" "">Bob</a>

Congratulations – you‘re fucked.

I took that implementation from an article called Foolproof HTML escaping in Javascript. In that article, the author refers to this technique is referred to as “the best way to escape HTML in JavaScript". Unfortunately, it’s probably the worst.

He‘s not alone, of course. Here‘s a bunch of articles and Stack Overflow questions in which this technique is suggested, all by different authors.

Choice quote:

If you are using createTextNode, you are safe.

Use jQuery? The same vulnerable technique is also possible using .text(), and is endorsed in these jQuery solutions. A few Stack Overflow users mention the vulnerability, but their comments are buried.

There’s even several JSPerf entry that compares this same technique against other more tried-and-true methods of escaping HTML.

Now – there’s a counter argument here. Which is, if you know the context in which you are injecting the value (i.e. not attributes), then this method is safe. My argument is that developers don’t understand those contexts well enough, and sooner or later they will get it wrong.

If you want to escape HTML, use the tried and true method: string replacement. It’s boring, but it works (provided you always use quoted attributes and specify charsets, that is).

function escapeHtml(str) {
    return String(str)
        .replace(/&/g, "&amp;")
        .replace(/</g, "&lt;")
        .replace(/>/g, "&gt;")
        .replace(/"/g, "&quot;")
        .replace(/'/g, "&#039;")
        .replace(/\//g, "&#x2F;")
}

This is a simple implementation. There are better ones out there.

NOTE: This also escapes forward slashes, which is recommended practice by OWASP. My understanding is that it is to handle cases where you are embedding user values inside a <script> tag.

Fight the good fight

In summary, HTML is scary, and DOM text methods present some very easy ways to get bitten. Worst of all, there is plenty of literature on the web which endorses their use in potentially vulnerable situations.

So if you’ve made it this far, I’ve got two things to ask of you.

First: it probably goes without saying, but audit your codebase to see if you’re messing around with element.textContent or document.createTextNode. They might be indicative of an XSS vulnerability. It can’t hurt to read OWASP’s XSS cheat sheet either.

Second: if you come across these anti-patterns – on Twitter, on Stack Overflow, or on the web in general – please do your part and shout/vote them down. You’ll be doing the web a favour.

Good night, and good luck.