文章内容

2017/7/30 17:32:15,作 者: 黄兵

彻底修正不规范的HTML,避免引起页面布局错乱

我在以前的一篇随笔“对于用户上传不规划Html而导致页面布局错乱的一简单解决方法”介绍了使用正则解决页面布局错乱的方法,那种方法只能解决部分情况,对于一些比较特殊的Html代码就无能为力了,一直想完善这个解决方法的,今天下定决心重新思考了解决方法。

Html错乱无非是标签不完整导制的,在实际程序中往往是用户直接把其它网页的内容粘贴到在线编辑器、或直接手动改写Html代码引起的,当然不乏有人恶意破坏,所以这次我使用类似语法分析的方式来修正不规范的HTML代码,不多说了,看代码吧:

public static string RepairHTML(string htmlStr)
        {
            StringBuilder sbReturn 
= new StringBuilder();
            
int subStart = 0;

            
string tagName = string.Empty;

            
bool isRunning = false;
            
bool isSearchEndTag = false;

            
int subElementCount = 0;

            StringBuilder tag 
= new StringBuilder();

            
for (int i = 0; i < htmlStr.Length; i++)
            {
                
if (isRunning || htmlStr[i].Equals('<'))
                {
                    
if (!isRunning)
                    {
                        isRunning 
= true;
                        
continue;
                    }

                    
if (!string.IsNullOrEmpty(tagName))
                    {
                        
if (isSearchEndTag && !htmlStr[i].Equals('>'))
                        {
                            tag.Append(htmlStr[i]);
                            
continue;
                        }

                        
if (!isSearchEndTag && htmlStr[i].Equals('<'))
                        {
                            isSearchEndTag 
= true;
                            
continue;
                        }

                        
if (_getTagName(tag) == tagName)
                        { 
//内部包含子元素
                            subElementCount++;
                        }

                        
if (tag.ToString().StartsWith("/"&& tagName == _getEndTagName(tag))
                        {
                            subElementCount
--;
                            
if (subElementCount >= 0continue;//

                            
string innerHtml = htmlStr.Substring(subStart, i - (tag.Length + 1- subStart);
                            
if (!string.IsNullOrEmpty(innerHtml))
                                sbReturn.Append(FormatHelper.RepairHTML(innerHtml));

                            sbReturn.Append(
"<" + tag.ToString() + ">");
                            tagName 
= string.Empty;
                            isRunning 
= false;
                        }

                        tag 
= new StringBuilder();
                        isSearchEndTag 
= false;
                        
continue;
                    }

                    
if (!htmlStr[i].Equals('>'))
                    {
                        tag.Append(htmlStr[i]);
                        
continue;
                    }

                    
if (string.IsNullOrEmpty(tagName))
                    {
                        subStart 
= i + 1;

                        tagName 
= _getTagName(tag);
                        
if (tagName.StartsWith("/"))
                        {
                            
//去掉只有结束标记的标签
                            tagName = string.Empty;
                            isRunning 
= false;
                            
continue;
                        }
                        
else if (noEndTags.Contains(tagName.TrimStart('/')))
                        {
                            
//处理自闭合标签
                            sbReturn.Append("<" + tag.ToString().TrimEnd('/'+ "/>");
                            tagName 
= string.Empty;
                            isRunning 
= false;
                            
continue;
                        }
                        
else
                        {
                            sbReturn.Append(
"<" + tag.ToString() + ">");
                        }

                        tag 
= new StringBuilder();
                    }

                }
                
else
                    sbReturn.Append(htmlStr[i]);
            }


            
if (!string.IsNullOrEmpty(tagName))
            {
                
//到文本结尾还查不到结束标签时自动补上
                sbReturn.Append(FormatHelper.RepairHTML(htmlStr.Substring(subStart)));
                sbReturn.Append(
"</" + tagName + ">");
            }

            
return sbReturn.ToString();
        }
        
/// <summary>
        
/// (内部方法)获取开始标签的tagName
        
/// </summary>
        
/// <param name="tag"></param>
        
/// <returns>返回tagName的大写形式</returns>
        private static string _getTagName(StringBuilder tag)
        {
            
return tag.ToString().Split(' ')[0].TrimEnd('/').ToUpper();
        }
        
/// <summary>
        
/// (内部方法)获取结束标签的tagName
        
/// </summary>
        
/// <param name="tag"></param>
        
/// <returns>返回tagName的大写形式</returns>
        private static string _getEndTagName(StringBuilder tag)
        {
            
return tag.ToString().Split(' ')[0].TrimStart('/').ToUpper();
        }
}

使用示例:

string badHtml = "<div><p>hi</p></div><div><p>这是测试<table><tr><td>这里少了<p>table/tr/td</p>的结束标签<hr>少了div的结束标签";
string repairedHtml = FormatHelper.RepairHTML(badHtml);

得到的repairedHtml 结果为:

<div><p>hi</p></div><div><p>这是测试<table><tr><td>这里少了<p>table/tr/td</P></TD></TR></TABLE></p><hr/>少了div的结束标签</DIV>

性能分析:

RepairHTML函数采用递归的方式对输入的字串进行分析,递归的次数取决于输入字串中含有的Html标签数(包括自定义的标签,如:<mytag></mytag>),循环的次数最大为=1+2+3+…+n  (假设输入字串的长度为n) ,即:n(n-1)/2  ,所以时间复杂度可认为是:O(n^2)

 

注:我只是做了些简单的测试,尚未发现bug,但不表示程序一定能正确运行,如你发现问题请告诉我,我也会不断完善这个小功能的。

(转载、使用请注明作者XiaoG、原文链接:http://www.cnblogs.com/XiaoG/archive/2009/08/26/1554448.html

分享到:

发表评论

评论列表