beautiful soup选择器之CSS选择器-详细内容 - 黄兵的个人博客

文章内容

2017/9/4 10:08:44,作者: 黄兵

beautiful soup选择器之CSS选择器

BeautifulSoup支持大部分的CSS选择器，其语法为：向tag或soup对象的.select()方法中传入字符串参数，选择的结果以列表形式返回。

　　tag.select("string")

　　BeautifulSoup.select("string")

源代码示例：

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="title" name="dromouse">
            <b>The Dormouse's story</b>
        </p>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="mysis" href="http://example.com/elsie" id="link1">
                <b>the first b tag<b>
                Elsie
            </a>,
            <a class="mysis" href="http://example.com/lacie" id="link2" myname="kong">
                Lacie
            </a>and
            <a class="mysis" href="http://example.com/tillie" id="link3">
                Tillie
            </a>;and they lived at the bottom of a well.
        </p>
        <p class="story">
            myStory
            <a>the end a tag</a>
        </p>
        <a>the p tag sibling</a>
    </body>
</html>
"""

soup = BeautifulSoup(html,'lxml')

　　1、通过标签选择

        
                            #
                                选择所有title标签

                            soup.select("title")
                        
                            #
                                选择所有p标签中的第三个标签

                            soup.select("p:nth-of-type(3)")
                            相当于soup.select(p)[2]
                        
                            #
                                选择body标签下的所有a标签

                            soup.select("body
                            a")
                        
                            #
                                选择body标签下的直接a子标签

                            soup.select("body
                            > a")
                        
                            #
                                选择id=link1后的所有兄弟节点标签

                            soup.select("#link1
                            ~ .mysis")
                        
                            #
                                选择id=link1后的下一个兄弟节点标签

                            soup.select("#link1
                            + .mysis")

　　2、通过类名查找

        
                            #
                                选择a标签，其类属性为mysis的标签

                            soup.select("a.mysis")

　　3、通过id查找

        
                            #
                                选择a标签，其id属性为link1的标签

                            soup.select("a#link1")

　　4、通过【属性】查找，当然也适用于class

        
                            #
                                选择a标签，其属性中存在myname的所有标签

                            soup.select("a[myname]")
                        
                            #
                                选择a标签，其属性href=http://example.com/lacie的所有标签

                            soup.select("a[href='http://example.com/lacie']")
                        
                            #
                                选择a标签，其href属性以http开头

                            soup.select('a[href^="http"]')
                        
                            #
                                选择a标签，其href属性以lacie结尾

                            soup.select('a[href$="lacie"]')
                        
                            #
                                选择a标签，其href属性包含.com

                            soup.select('a[href*=".com"]')
                        
                            #
                                从html中排除某标签，此时soup中不再有script标签

                            [s.extract() for s in soup('script')]
                        
                            #
                                如果想排除多个呢

                            [s.extract() for s in soup(['script','fram']

　　

　　5、tag.select

        
                            #
                                选择第一个a标签中的b标签的文本内容

                            atags
                                = soup.select('a')[0]
                        
                            atags
                                = atags.select('b')[0].get_text()
                        
                            print
                                atags

上一篇：Python 爬虫遇到形如 小说 的编码如何转换为中文？
下一篇：C#实现百度站长工具的主动推送功能

分享到：

发表评论

评论列表

搜索文章

文章归档