python 第三方库BeautifulSoup4文档学习（4）

摘要：

soup=BeautifulSoupa_tags=soup.find_all("a")print#[Elsie,Lacie,Tillie]2.正则表达式需要引入re模块，导入以后可通过complie()将普通的字符串编译为正则表达式，find_all()方法将会通过正则表达式的match()去匹配soup文档中的标签。recode=re.compiletag_b_start=soup.find_allfortagintag_b_start:print#body#b3.列表如果在搜索方法中传入列表参数，那么该方法就会将列表中的任一参数符合匹配结果的返回。defhas_class_but_no_id:returntag.has_attrandnottag.has_attrtags=soup.find_allprint#[TheDormouse'sstory,Onceuponatimetherewerethreelittlesisters;andtheirnameswere#Elsie,#Lacieand#Tillie;#andtheylivedatthebottomofawell.,...]自定义的方法可以由自己的想法设计到比较复杂。namename参数可以查找所有标签符合name的值的，字符串对象会被自动过滤掉。

bs4 搜索文档树

搜索文档树有很多方法，比较常用的是find()和find_all() ，在方法中我们通常需要加上特定的参数去查找我们需要的内容，这样的参数就被看作为过滤器。

依然使用官方提供的测试html文档

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie"   id="link1">Elsie</a>,
<a href="http://example.com/lacie"   id="link2">Lacie</a> and
<a href="http://example.com/tillie"   id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>

"""

过滤器

1.字符串

通过向搜索方法中传入字符串参数进行搜索，这是最简单的一种方式，缺点在于搜索结果往往比较模糊。

soup = BeautifulSoup(html_doc,"html.parser")
a_tags = soup.find_all("a")
print(a_tags)

# [<a   href="http://example.com/elsie" id="link1">Elsie</a>, <a   href="http://example.com/lacie" id="link2">Lacie</a>, <a   href="http://example.com/tillie" id="link3">Tillie</a>]

2.正则表达式

需要引入re模块，导入以后可通过complie()将普通的字符串编译为正则表达式，find_all()方法将会通过正则表达式的match()去匹配soup文档中的标签。

recode = re.compile("^b")
tag_b_start = soup.find_all(recode)
for tag in tag_b_start:
    print(tag.name)

# body
# b

3.列表

如果在搜索方法中传入列表参数，那么该方法就会将列表中的任一参数符合匹配结果的返回。

tags_a_b = find_all(['a','b'])
for tag in tags_a_b:
    print(tag)
    
# <b>The Dormouse's story</b>
# <a   href="http://example.com/elsie" id="link1">Elsie</a>
# <a   href="http://example.com/lacie" id="link2">Lacie</a>
# <a   href="http://example.com/tillie" id="link3">Tillie</a>

这里我的列表参数只有字符串，也可以将前面的正则表达式传入列表中。

4.True

True可以匹配任何值（字符串除外）。

tags = soup.find_all(True)
for tag in tags:
    print(tag.name)
    
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

自定义方法

如果在上面所列的过滤器中都无法获取想要的结果，可以自定义方法来指定自己要获取的内容。

def has_class_but_no_id(tag):
    return tag.has_attr("class") and not tag.has_attr("id")
tags = soup.find_all(has_class_but_no_id)
print(tags)

# [<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
# <a   href="http://example.com/elsie" id="link1">Elsie</a>,
# <a   href="http://example.com/lacie" id="link2">Lacie</a> and
# <a   href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>, <p class="story">...</p>]

自定义的方法可以由自己的想法设计到比较复杂。

find_all()方法

find_all()能够搜索并过滤当前的bs文档树中符合过滤器参数的tag的子节点；接下来着重说一下该方法中的一些参数。

name

name参数可以查找所有标签符合name的值的，字符串对象会被自动过滤掉。

a_tags = soup.find_all("a")
print(a_tags)

# [<a   href="http://example.com/elsie" id="link1">Elsie</a>, <a   href="http://example.com/lacie" id="link2">Lacie</a>, <a   href="http://example.com/tillie" id="link3">Tillie</a>]

补充：name可以为任一类型的过滤器，即可以为字符串、正则表达式、列表、True、方法。

keyword

关键字参数，顾名思义传入的是要查找的标签中的关键字，然后find_all()方法会自动查找每一哥含有关键字的标签并返回。

例如，查找包含link关键字的标签：

tags = soup.find_all(id=re.compile(r'link'))
print(tags)

# [<a   href="http://example.com/elsie" id="link1">Elsie</a>, <a   href="http://example.com/lacie" id="link2">Lacie</a>, <a   href="http://example.com/tillie" id="link3">Tillie</a>]

关键字参数部分属性无法支持，如class；但是可以find_all()的attrs参数来获取

tags = soup.find_all(attrs={"class":"sister"})
print(tags)

# [<a   href="http://example.com/elsie" id="link1">Elsie</a>, <a   href="http://example.com/lacie" id="link2">Lacie</a>, <a   href="http://example.com/tillie" id="link3">Tillie</a>]

CSS搜索

我们按照CSS搜索主要是针对tag的搜索，搜索方式主要按照CSS中的class属性，但是因为class在python中是保留字，所以一般是使用class_来进行标识，下面还是一些例子来说明：

import requests
from bs4 import BeautifulSoup
import re

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
}
# 这里用百度首页为例
html = requests.get("http://www.baidu.com",headers=headers)
html.encoding = "UTF-8"
# soup文档
soup = BeautifulSoup(html.text,"html.parser")
# 查找class属性以side_entry结尾的div
div_soup_list = soup.find_all("div",class_=re.compile("aging-entry$"))
# class属性为多值属性，即一个class可能为多个值的组合，所以我们可以分别搜索tag中的每个类名
div_soup = BeautifulSoup(str(div_soup_list),"html.parser")
# tag = div_soup.find_all("div",class_="c-color-text")
tag = div_soup.find_all("div",class_="toast")
# 上面的输出结果一样都为[<div class="c-color-text toast">辅助模式</div>]
print(tag)

string参数

string参数也可以是多种类型，与name参数相似，可以是字符串、True、列表、正则表达式，使用与之前的name参数类似，具体参考例子：

a_tag = soup.find_all(string = "Lacie")
a_tag = soup.find_all(string = True)
a_tag = soup.find_all(string = ["Elsie","Lacie","Tillie"])
# string参数还能与其他参数混用，例如与name参数混用：
a_tag = soup.find_all("a",string = re.compile("^E.+?e"))

limit参数

与sql语句中的limit语句类似，当我们想要限制搜索结果时可以使用，当满足limit的限制数量时就会结束搜索并返回结果。

a_tag = soup.find_all("a",string = True,limit=2)

recursive参数

我们在使用find_all()搜索文档树时，是全文档的tag节点搜索，如果只想搜索某个节点之下的某个节点，可以使用该参数。

# 对比使用前后,使用前body是html的直接子节点,整个soup文档的孙节点,没有recursive参数限制所以能够找到;使用后只能找到html
# tag = soup.find_all("body")
tag = soup.find_all("body",recursive=False)

特别说明

简化find_all方法,我们可以像调用find_all()方法一样使用tag.例如:

#这两个例子完全等价,只要find_all()中有的,都可以直接使用tag来操作,显得更简洁
# result = soup.find_all("a")
result = soup("a")

搜索文档树使用频次最高的方法就是find_all()方法，还有其他搜索方法如find、find_parent与find_parents、find_next_sibling与find_next_siblings等，都是类似的使用，无非就是搜索的侧重点、范围不一样，这里不再一一叙述，可以查看bs4库官方文档。

通过CSS选择器搜索文档树

bs支持大部分CSS选择器，CSS语法可参考w3school,使用soup或者tag的select()方法，传入具有CSS选择器语法的字符串参数。
下面是一些例子:

# 返回的结果类型是tag.name为p的结果集
# tag_set = soup.select("p")
# tag_set = soup.select("div p")
# 返回p的子节点为a标签的元素结果集
# tag_set = soup.select("p > a")
# 返回soup文档中div中p为最后一个节点的元素结果集
# tag_set = soup.select("div p:nth-last-child(1)")

python 第三方库BeautifulSoup4文档学习（4）

过滤器

1.字符串

2.正则表达式

3.列表

4.True

自定义方法

find_all()方法

name

keyword

CSS搜索

string参数

limit参数

recursive参数

特别说明

通过CSS选择器搜索文档树

相关文章

Python实现快捷输入（类似WeGame的一键喊话）

python科学计算库-pandas

python处理xml大文件[xml.sax]

miniconda安装和使用

Python之win32模块

python 抓取cisco交换机配置文件

最新文章

随机推荐

思享工具箱导航

JSON工具

格式化转换

加解密编码

文本数字

网络

站长

计算

其他

对照列表