Musings of ErisDS
beta
ErisDS

Have you ever tried to parse, process or preg_replace some HTML? Ever tried to do it when the HTML is UTF-8 encoded? Getting rid of white space can be tricky, here’s a few tricks I’ve learned.

I was playing around with debugging some HTML output from Symfony the other day. The HTML was dynamic content added through TinyMCE and I needed to clean superfluous formatting so that the content output looked right. I hate regular expressions and this is the first time I’ve really needed to delve into them properly, so I totally hit a stumbling block when trying to get rid of   and other bits of whitespace.

I tried various methods of decoding the content first, to no avail. It turns out the answer is to look for the encoded characters. I found it quite tricky to discover what the encoded values for the white space were and also to detect exactly what white space characters were present, but hopefully these two tips will help you.

UTF-8 representations of white space characters

The following are common white space characters. If you’re missing a few, try out File Format for finding character codes.

\x20 – The standard space or ‘\s’
\xC2\xA0 – The non-breaking space ‘ ’
\x0D – Carriage Return or ‘\r’
\x0A – New Line or ‘\n’
\x09 – The tab or ‘\t’

Discovering which characters are present

Figuring out exactly what white space characters are present in your encoded HTML can be tricky. I used XVI32, which is a hex editor. If you view the source of the HTML you are trying to clean and copy & paste an offending section into the right-hand window it will show the encoded characters in the left hand window.

Hopefully this post might save you some time. If you have a similar problem that isn’t covered here, I’d be interested to know so drop me a comment!

Update: I thought it would be helpful to add an example of how I used these character codes. This simple function replaces all encoded non-breaking spaces with a normal space. It’s only a small snippet, but it might be of use ;).

1
2
3
4
5
6
function cleanNonBreakingSpaces($text)
{
  $text = preg_replace('/\xC2\xA0/',' ',$text);

  return $text;
}

You could replace the space with an encoded “normal” space character, although I’m not sure how well that will work:

1
$text = preg_replace('/\xC2\xA0/','\x20',$text);

Or you could check for both encoded or non-encoded non-breaking spaces:

1
$text = preg_replace('/(\xC2\xA0/| )',' ',$text);

Resources

Share this...

  •  Add 'Getting Rid of Non Breaking Spaces (&nbsp)' to Del.icio.us
  • Add 'Getting Rid of Non Breaking Spaces (&nbsp)' to Twitter
  • Add 'Getting Rid of Non Breaking Spaces (&nbsp)' to digg
  • Add 'Getting Rid of Non Breaking Spaces (&nbsp)' to FURL
  • Add 'Getting Rid of Non Breaking Spaces (&nbsp)' to reddit
  • Add 'Getting Rid of Non Breaking Spaces (&nbsp)' to Technorati
  • Add 'Getting Rid of Non Breaking Spaces (&nbsp)' to Newsvine
  • Add 'Getting Rid of Non Breaking Spaces (&nbsp)' to Stumble Upon
  • Add 'Getting Rid of Non Breaking Spaces (&nbsp)' to Google Bookmarks
  • Add 'Getting Rid of Non Breaking Spaces (&nbsp)' to FaceBook

Comments

12 Comments to "Getting Rid of Non Breaking Spaces (&nbsp)"
  1. 14th Jul

    Stanton says:

    Thanks for taking the time to write this up, I tend to come across this problem every now and then and will be bookmarking it for future reference :)

  2. 15th Jul

    Rob Mason says:

    An article on preg_replace would be a useful one .

    R ;)

  3. 19th Jul

    ErisDS says:

    @Rob I don’t think I’m exactly the authority to be giving tutorials on preg_replace. But I’ve added an example above of how to use it to clean out encoded non breaking spaces!

  4. 27th Jul

    Daniel says:

    Thanks for this info. This helped me get rid of annoying non-breaking spaces in WordPress output.

  5. 27th Jul

    ErisDS says:

    Awesome! Great to be able to help and always lovely to hear some feedback :D

  6. 31st Jul

    Will says:

    Hi,

    Sometimes I see codes like %2F in the url. The / and : are changed.

    Do you know if there is any tool that can convert
    %2F to /
    %3A to :

    Thanks

  7. 26th Aug

    ErisDS says:

    Those are hex codes, so the tool in my post will convert these (if you remove the %).
    Different languages have different functions for converting these as well if that is what you are asking?

  8. 8th Apr

    Shailesh says:

    Thank you for this very well written article. It helped me find the solution immediately. BTW, if you just remove \x0A in UTF-8 content, and leave behind the prefix of \xC2, then when you try to insert this data into MySQL using a PHP script, the data will be truncated silently in the table. This happens when the table/database/connection is set up to receive UTF-8 data.

  9. 7th Mar

    Ciki says:

    Man, you helped me a lot! I spent about an hour to find this solution. Thank you very much!

  10. 21st Apr

    Nikolay says:

    Thanks a lot!
    This \xC2\xA0 – The non-breaking space ‘ ’ save my day.
    Will remember to try in future the Hex editor in my Ubuntu :-)

  11. 18th Aug

    Pat says:

    I’m glad to know that someone is trying to solve this problem. Unfortunately for this poor user, everything you wrote is like an alien language to me. :(

  12. 28th Nov

    Kevin says:

    Your function totally sorted out my problem. Thanks so much!

Add your thoughts

  • XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>