Have you ever tried to parse, process or preg_replace some HTML? Ever tried to do it when the HTML is UTF-8 encoded? Getting rid of white space can be tricky, here’s a few tricks I’ve learned.
I was playing around with debugging some HTML output from Symfony the other day. The HTML was dynamic content added through TinyMCE and I needed to clean superfluous formatting so that the content output looked right. I hate regular expressions and this is the first time I’ve really needed to delve into them properly, so I totally hit a stumbling block when trying to get rid of and other bits of whitespace.
I tried various methods of decoding the content first, to no avail. It turns out the answer is to look for the encoded characters. I found it quite tricky to discover what the encoded values for the white space were and also to detect exactly what white space characters were present, but hopefully these two tips will help you.
UTF-8 representations of white space characters
The following are common white space characters. If you’re missing a few, try out File Format for finding character codes.
\x20 – The standard space or ‘\s’
\xC2\xA0 – The non-breaking space ‘ ’
\x0D – Carriage Return or ‘\r’
\x0A – New Line or ‘\n’
\x09 – The tab or ‘\t’
Discovering which characters are present
Figuring out exactly what white space characters are present in your encoded HTML can be tricky. I used XVI32, which is a hex editor. If you view the source of the HTML you are trying to clean and copy & paste an offending section into the right-hand window it will show the encoded characters in the left hand window.
Hopefully this post might save you some time. If you have a similar problem that isn’t covered here, I’d be interested to know so drop me a comment!
$text = preg_replace('/\xC2\xA0/',' ',$text);
You could replace the space with an encoded “normal” space character, although I’m not sure how well that will work:
$text = preg_replace('/\xC2\xA0/','\x20',$text);
Or you could check for both encoded or non-encoded non-breaking spaces:
$text = preg_replace('/(\xC2\xA0/| )',' ',$text);