This is a common problem I run it to when scrapping websites for content or even dealing with user submitted text. I ran into problems using striptags() not removing everything and general formatting errors. As a result I came up with this function to assist in the process.
function clean_text($str) { $search = array( //'/[^a-z0-9s-/:#=.?!,;&()]/i', // Remove special characters '/(^[rn]*|[rn]+)[st]*[rn]+/', //remove extra blank lines '/(?: |rsquo| )+/', //all references '@@si', // Strip out javascript '@
You will notice I am not using the remove special characters line, thats because it has caused me problems in the past. I also convert special chars common in many languages to their Latin characters, this is to prevent parsing errors latter and since my sites are English only this was the best solution for my needs.