How To Remove “Bad” Html Tags From Text Using C#
September 4th, 2007 | by programming |The following method removes unwanted html tags from text. It is very useful for removing malicious code from form inputs.
// RemoveHTMLTags method removes *most* malicious code leaving allowed // HTML in placet public static string removeHTMLTags(string input) { string output = “”; // break the comments so someone cannot add an open comment input = input.Replace(“<!–”, “”); // strip out comments and doctype Regex docType = new Regex(“<!DOCTYPE[.]*>”); output = docType.Replace(input, “”); // add target=”_blank” to hrefs and remove parts that are // not supported output = Regex.Replace(output, “(.*)”, @”$5″); // strip out most known tags except (a|b|br|blockquote|em|h1|h2| h3|h4|h5|h6|hr|i|li|ol|p|u|ul|strong|sub|sup) Regex badTags = new Regex(“< [/]{0,1}(abbr|acronym|address|applet |area|base|basefont|bdo|big|body|button|caption|center|cite|code|col |colgroup|dd|del|dir|div|dfn|dl|dt|embed|fieldset|font|form|frame |frameset|head|html|iframe|img|input|ins|isindex|kbd|label|legend |link|map|menu|meta|noframes|noscript|object|optgroup|option |param|pre|q|s|samp|script|select|small|span|strike|style|table |tbody|td|textarea|tfoot|th|thead|title|tr|tt|var|xmp){1}[.]*>”); return badTags.Replace(output, “”); }
By Scott Elkin on Aug 4, 2008 | Reply
This code has a pretty severe bug in it. This line is pretty gross and makes no sense?
output = Regex.Replace(output, “(.*)”, @”$5″);