DiggingIntoWordPress

by Chris Coyier & Jeff Starr

Clean Up Weird Characters in Database

Posted by on

It’s been a crazy month, with lots of drama all over the place. Here at DigWP.com, we had an episode where the site was all screwed up and not loading or only partially loading, blank white pages, and the whole bit. During the process of keeping it together and trying to restore full functionality, numerous database imports and exports were performed under a variety of circumstance. During the rush, apparently the most recent database backup file was somehow uncompressed outside of MySQL before final import. Several days later, that decompression/unzipping basically converted every quotation mark, em dash, en dash, ellipses and other special characters into some really ugly-looking codes.

[ Weird Characters ]

What are they?

I think what happened is that the restoration database that we ended up using had been opened in a file/text editor. It’s just a guess, and sort of irrelevant, but the text editor converted our UTF-8 characters into some other character set, like ISO-8859-1. So after restoration, we ended up with hundreds of these weird characters in the database – quotes, hyphens, dashes, and ellipses were all converted to Klingon:

“ = left quote = “
” = right quote = ”

‘ = left single quote = ‘
’ = right single quote = ’

— = en dash = –
– = em dash = —

• = hyphen = -
… = ellipsis = …

Identifying most of these characters was relatively painless, but the en-dash and em-dash characters may be reversed (i.e., – = em dash, and — = en dash). Testing the other character replacements in the database was easy, but discerning between instances of em & en dashes proved futile. So do your own testing and make good backups before making any mass changes. Hopefully someone can help us out with more of the specifics.

Clean ‘em up

Before making any changes to your database, make sure you have a good backup (or three). Then to clean up these weird characters from the WordPress database, use a program like phpMyAdmin to execute the following queries.

Clean up post_content

UPDATE wp_posts SET post_content = REPLACE(post_content, '“', '“');
UPDATE wp_posts SET post_content = REPLACE(post_content, '”', '”');
UPDATE wp_posts SET post_content = REPLACE(post_content, '’', '’');
UPDATE wp_posts SET post_content = REPLACE(post_content, '‘', '‘');
UPDATE wp_posts SET post_content = REPLACE(post_content, '—', '–');
UPDATE wp_posts SET post_content = REPLACE(post_content, '–', '—');
UPDATE wp_posts SET post_content = REPLACE(post_content, '•', '-');
UPDATE wp_posts SET post_content = REPLACE(post_content, '…', '…');

Clean up comment_content

UPDATE wp_comments SET comment_content = REPLACE(comment_content, '“', '“');
UPDATE wp_comments SET comment_content = REPLACE(comment_content, '”', '”');
UPDATE wp_comments SET comment_content = REPLACE(comment_content, '’', '’');
UPDATE wp_comments SET comment_content = REPLACE(comment_content, '‘', '‘');
UPDATE wp_comments SET comment_content = REPLACE(comment_content, '—', '–');
UPDATE wp_comments SET comment_content = REPLACE(comment_content, '–', '—');
UPDATE wp_comments SET comment_content = REPLACE(comment_content, '•', '-');
UPDATE wp_comments SET comment_content = REPLACE(comment_content, '…', '…');

Other tables

While cleaning up the DigWP database, several other weird characters also showed up in various places, but they were very few in number. I also noticed several instances of converted quotes, dashes, and hyphens scattered around in some other tables, mostly in the options table, buried deep within temporary rss_ data. So I didn’t bother with anything beyond the post_content and comment_content tables, but easily could have done so by modifying the previous queries like so:

UPDATE [table_name] SET [col_name] = REPLACE([col_name], '…', '…');

Just replace [table_name] with whatever table you want to clean up, col_name with the column name, and then replicate or edit the query with the proper character replacements.

Lesson learned

Take-home message: don’t open your database in a text editor. But if you do, execute these SQL queries for easy clean-up.

20 Responses

  1. See the plugin or standalone program Adminer for this job, its very easy and inside the backend of WP, and also very fast and light. Also you can use a plugin liek Search & Replace for this jop – maybe its solutions a little bit easier for many users.

  2. This happened to me fairly recently when I moved my website. I wish I would have had this information (or realized what had happened) then, but I’m glad I found it now for future reference. Thanks for putting it together!

  3. This is awesome… I have 2-3 sites where the characters seem to get mixed up on a regular basis. This will be an easy fix & hopefully it will be a one-timer for mine.

  4. Would it not be better to just chuck entities in there?

    eg: REPLACE(post_content, '…', '…')

    • Yes, this is an excellent idea. I actually did replace the em/en dashes with HTML equivalents during the procedure. Not everyone uses them, so I went with UTF-8 entities for the lot of ‘em.

  5. Dang, my entity echo’d out.

    eg: REPLACE(post_content, '…', '& hellip ;')

  6. if you’re running mysqldump from the command line, make sure you use the -r switch rather than redirecting output to a file.

  7. That would make a great little plugin… justsayin

  8. There’s a great plugin called Search Replace that can accomplish the same task for those who are hesitant to mess with the database directly.

  9. lawless July 25, 2011

    I ran into the same problem with a migration from mySQL 4 to mySQL 5.
    For some reason all the special characters got messed up in the migration, likely for the same character encoding issue.

    This is a useful SQL snippet to keep around for future reference though.

  10. A small little typo at the last bit there :-

    UPDATE [table_name] SET [table_name] = REPLACE([table_name], '…', '…');

    Should be

    UPDATE [table_name] SET [col_name] = REPLACE([col_name], '…', '…');

  11. Stephanie July 28, 2011

    What an odd situation to have. I’ve tried to Search Replace plugin and that does a lot of good. I used it to replace a bunch of links that were either updated or removed. If you know a way to prevent weird characters from messing up the database, I look forward to it. I’m sure it’ll be something with html entities and converting them beforehand.

  12. Hi
    I (actually helping a friend with this problem) have a similar issue but it is bad character encoding with a plugin/text widget. This can be seen online @exploreinternetmarketingonline

    How it should look like can be seen atonlinewpthemetest

    I am grateful for any advice on how to fix this please! Thanks

  13. Fantastic stuff, very useful.

    Just two small tips/etc:

    1) This code WILL NOT work if pasted into a MySQL terminal session via PuTTY etc – it works perfectly in phpMyAdmin though.
    2) I found a lot in the titles of posts too – so I did the same set of queries for post_title as well.

  14. This is extremely handy. I’ve been procrastinating fixing all the old posts that were messed up when our old hosting provider rev’d MySQL without warning.

    Thanks for this – great time saver!

  15. Mukesh Kumar September 8, 2011

    Â replace with ?????????

Comments are closed. Contact us with any critical information. Thank you!

Code is poetry