A number of the sites I manage run a custom CMS written from the ground up. A problem I ran into recently for one of my larger sites with over 5,500 content items is how to check if a new item might be a duplicate before its added. The solution I came up with uses Mysql full text search to do a MATCH() on both the title and content body and provide me with a score so I can show only the top items it might be a duplicate with.
SELECT * FROM (SELECT content_id, title, content_text, STATUS , update_date, MATCH (title, content_text)AGAINST ('a string match will be checking for in the db' IN NATURAL LANGUAGE MODE) AS score FROM content) AS t ORDER BY t.score DESC LIMIT 5
The double select was needed because of the way match works. The main thing to understand is doing it this way allows you do this in a single sql statement.
The score changes every time, so you can't say where score is > 10. The best solution I found was to order the results in descending order by score and only show the top 5 results.
This is probably a bit of a hack but so far its working for my needs. The other option would have been pretty complex and created a lot of server load. Essentially, I would have had to loop through every content item using php and do a statistical analysis of each item to the item being compared to determine how different they are. With some research I found that it may be possible to do this in a cleaner way using postgresql but this site was created in mysql over 10 years ago so switching databases really was not an option right now.