 A number of the sites I manage run a custom CMS written from the ground up.  A problem I ran into recently for one of my larger sites with over 5,500 content items is how to check if a new item might be a duplicate before its added. The solution I came up with uses Mysql full text search to do a MATCH() on both the title and content body and provide me with a score so I can show only the top items it might be a duplicate with.
A number of the sites I manage run a custom CMS written from the ground up.  A problem I ran into recently for one of my larger sites with over 5,500 content items is how to check if a new item might be a duplicate before its added. The solution I came up with uses Mysql full text search to do a MATCH() on both the title and content body and provide me with a score so I can show only the top items it might be a duplicate with.
SELECT * FROM (SELECT content_id, title, content_text, STATUS , update_date, MATCH (title, content_text)AGAINST ('a string match will be checking for in the db' IN NATURAL LANGUAGE MODE) AS score FROM content) AS t ORDER BY t.score DESC LIMIT 5The double select was needed because of the way match works. The main thing to understand is doing it this way allows you do this in a single sql statement.
The score changes every time, so you can't say where score is > 10. The best solution I found was to order the results in descending order by score and only show the top 5 results.
This is probably a bit of a hack but so far its working for my needs. The other option would have been pretty complex and created a lot of server load. Essentially, I would have had to loop through every content item using php and do a statistical analysis of each item to the item being compared to determine how different they are. With some research I found that it may be possible to do this in a cleaner way using postgresql but this site was created in mysql over 10 years ago so switching databases really was not an option right now.