HOWTO: Removing Duplicate Posts in WordPress

We've been using a spare WordPress installation (via Postie) to trac job comments that aren't in Trac, and a Lilina installation to correlate and aggregate all the RSS feeds together. The mashed tools together work really, really well - Mummy, is this Web2.0? - but a misconfigured mail filter resulted in about 400 duplicate posts on the WordPress install, which didn't help.

So I needed to remove the duplicate posts from that WordPress install - they all had identical content and titles, but different timestamps and IDs (of course). Here's a quick HOWTO for anyone else who has aggregated or automated themselves into having lots of duplicate WordPress posts.

MySQL allows you to create a table from a SELECT query, and is polite enough to maintain the table structure. So, assuming your duplicate information is like mine, a query that will produce a table with a single instance of each unique post_title and post_content will look something like this:

CREATE TABLE wp_posts_NODUPLICATES
SELECT DISTINCT
MIN(ID) as ID, post_author, min(post_date) as post_date,
MIN(post_date_gmt) as post_date_gmt,
post_content, post_title, post_category, post_excerpt,
post_status, comment_status, ping_status,
post_password, post_name, to_ping, pinged,
MIN(post_modified) as post_modified, MIN(post_modified_gmt) as post_modified_gmt,
post_content_filtered, post_parent, guid, menu_order
FROM wp_posts
GROUP BY post_content, post_title ;

Once that's done, you should have a table with all of your posts, but only one of each. If your duplication differs, you'll need to tweak the above to suit your case ...
From there, you need to swap your new cleaned table in:

ALTER TABLE wp_posts RENAME wp_posts_DUPLICATES ;
ALTER TABLE wp_posts_NODUPLICATES RENAME wp_posts ;

Bingo!

Comments

Thanks!

Man, you saved me a TON of time, and really really appreciate it!

I apparently had 20k duplicate out of 52k posts.

additional columns

In my install, Wordpress 2.9.2 the wp_posts table has additional 3 columns:

| post_type | varchar(20) | NO | MUL | post | |
| post_mime_type | varchar(100) | NO | | | |
| comment_count | bigint(20) | NO | | 0 | |

How would you setup the CREATE statement now?

That depends.

If you want to retain those columns, you'd include them in the query. If you want to drop them because they are the only columns which differ, you'd exclude them or use a MIN() or MAX() or SUM() to correct them. So, I'd probably include those in their original position in the table in the query, unless they were the only differing columns (which is unlikely - in the example I had to work with, the differing columns were only auto increments and timestamps) - but if you somehow had comments on your dupe posts, one of those columns might be significant and you'd probably want to keep the MAX(comment_count)?

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <pre> <code> <br> <h2> <h3> <h4> <h5> <h6>
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.