Have you ever tried to parse, process or preg_replace some HTML? Ever tried to do it when the HTML is UTF-8 encoded? Getting rid of white space can be tricky, here’s a few tricks I’ve learned.
I was playing around with debugging some HTML output from Symfony the other day. The HTML was dynamic content added through TinyMCE and I needed to clean superfluous formatting so that the content output looked right. I hate regular expressions and this is the first time I’ve really needed to delve into them properly, so I totally hit a stumbling block when trying to get rid of and other bits of whitespace.
I tried various methods of decoding the content first, to no avail. It turns out the answer is to look for the encoded characters. I found it quite tricky to discover what the encoded values for the white space were and also to detect exactly what white space characters were present, but hopefully these two tips will help you.
UTF-8 representations of white space characters
The following are common white space characters. If you’re missing a few, try out File Format for finding character codes.
\x20 – The standard space or ‘\s’
\xC2\xA0 – The non-breaking space ‘ ’
\x0D – Carriage Return or ‘\r’
\x0A – New Line or ‘\n’
\x09 – The tab or ‘\t’
Discovering which characters are present
Figuring out exactly what white space characters are present in your encoded HTML can be tricky. I used XVI32, which is a hex editor. If you view the source of the HTML you are trying to clean and copy & paste an offending section into the right-hand window it will show the encoded characters in the left hand window.

Hopefully this post might save you some time. If you have a similar problem that isn’t covered here, I’d be interested to know so drop me a comment!
Update: I thought it would be helpful to add an example of how I used these character codes. This simple function replaces all encoded non-breaking spaces with a normal space. It’s only a small snippet, but it might be of use ;).
1 2 3 4 5 6 | function cleanNonBreakingSpaces($text) { $text = preg_replace('/\xC2\xA0/',' ',$text); return $text; } |
You could replace the space with an encoded “normal” space character, although I’m not sure how well that will work:
1 |
Or you could check for both encoded or non-encoded non-breaking spaces:
1 |
Resources
- File Format – An excellent site for looking up character codes.
- XVI Freeware Hex Editor – A great tool for discovering hidden characters.


14th Jul
Stanton says:
Thanks for taking the time to write this up, I tend to come across this problem every now and then and will be bookmarking it for future reference :)
15th Jul
Rob Mason says:
An article on preg_replace would be a useful one .
R ;)
19th Jul
ErisDS says:
@Rob I don’t think I’m exactly the authority to be giving tutorials on preg_replace. But I’ve added an example above of how to use it to clean out encoded non breaking spaces!
27th Jul
Daniel says:
Thanks for this info. This helped me get rid of annoying non-breaking spaces in WordPress output.
27th Jul
ErisDS says:
Awesome! Great to be able to help and always lovely to hear some feedback :D
31st Jul
Will says:
Hi,
Sometimes I see codes like %2F in the url. The / and : are changed.
Do you know if there is any tool that can convert
%2F to /
%3A to :
Thanks
26th Aug
ErisDS says:
Those are hex codes, so the tool in my post will convert these (if you remove the %).
Different languages have different functions for converting these as well if that is what you are asking?