爬虫小案例——爬取校花网

摘要:
抓取小新华网#页面路由规则#href=“http://www.xiaohuar.com/list-1-0.html“第1页#href=”http://www.xiaohuar.com/list-1-1.html“第2页#href=”http://www.xiaohuar.com/list-1-2.html“第3页#href=”http://www.xiaohuar.com/list-1-3.ht

爬取校花网图片

# 页面路由规律
# href = "http://www.xiaohuar.com/list-1-0.html" 第一页
# href = "http://www.xiaohuar.com/list-1-1.html" 第二页
# href = "http://www.xiaohuar.com/list-1-2.html" 第三页
# href = "http://www.xiaohuar.com/list-1-3.html" 第四页


# 生成所有的页码
def get_page_url():
    for i in range(2):
        yield 'http://www.xiaohuar.com/list-1-{}.html'.format(i)
# for url in get_page_url():
#     print(url)


from requests_html import HTMLSession
import os
session = HTMLSession()

# 第一页解析测试
# url = 'http://www.xiaohuar.com/list-1-0.html'
# r = session.request(method='get', url=url, headers=headers)
# # print(r.text)
# img_element_list = r.html.find('[class="img"] img')
# # print(img_element_list)
# for img_element in img_element_list:
#     print(img_element.attrs.get('alt'))
#     print(r.html.base_url[:-1] + img_element.attrs.get('src'))


# 解析页面,获取图片名和url
def parse_page(url):
    r = session.request(method='get', url=url)

    img_element_list = r.html.find('[class="img"] img')

    for img_element in img_element_list:
        file_name = img_element.attrs.get('alt').replace('/', '').replace('\', '') + '.png'
        print(file_name)
        file_url = img_element.attrs.get('src')
        file_url = r.html.base_url[:-1] + file_url if not file_url.startswith('http') else file_url   # 处理相对路径和绝对路径
        save_file(file_name, file_url)


def save_file(name, url):
    base_path = '校花图片'
    file_path = os.path.join(base_path, name)
    r = session.get(url=url)
    with open(file_path, 'wb') as f:
        f.write(r.content)
        print('%s下载成功' % name)


if __name__ == '__main__':
    for page_url in get_page_url():
        parse_page(page_url)

爬取校花网视频

# 页面路由规律
# http://www.xiaohuar.com/list-3-0.html 第一页
# http://www.xiaohuar.com/list-3-1.html 第二页
# http://www.xiaohuar.com/list-3-2.html 第三页
# http://www.xiaohuar.com/list-3-3.html 第四页
# http://www.xiaohuar.com/list-3-4.html 第五页
# http://www.xiaohuar.com/list-3-5.html 第六页


from requests_html import HTMLSession
import os
session = HTMLSession()


# 获取索引页url
def get_index_page():
    for i in range(6):
        url = 'http://www.xiaohuar.com/list-3-%s.html' % i
        yield url


# 解析索引页测试
# url = 'http://www.xiaohuar.com/list-3-5.html'
# r = session.get(url=url)
# # print(r.html.find('#images a[class="imglink"]'))
# for element in r.html.find('#images a[class="imglink"]'):
#     print(element.attrs.get('href'))


# 解析索引页获取详情页url
def get_detail_page(url):
    r = session.get(url=url)

    for element in r.html.find('#images a[class="imglink"]'):
        print(element.attrs.get('href'))
        yield element.attrs.get('href')


# 测试解析详情页获取视频url,名字
# url = 'http://www.xiaohuar.com/p-3-13.html'
# # url = 'http://www.xiaohuar.com/p-3-5.html'
# r = session.get(url=url)
# r.html.encoding = 'gbk'
# file_name = r.html .find('title', first=True).text.replace('\', '')
#
# print(file_name)
#
# element = r.html.find('#media source', first=True)
# if element:
#     video_url = element.attrs.get('src')
#     print(video_url)
# else:
#     video_url = r.html.search('var vHLSurl    = "{}";')[0]
#     print(video_url)


# 解析详情页获取视频url,名字
def get_url_name(url):
    r = session.get(url=url)
    r.html.encoding = 'gbk'
    file_name = r.html.find('title', first=True).text.replace('\', '')
    print(file_name)

    element = r.html.find('#media source', first=True)
    if element:
        video_url = element.attrs.get('src')
        video_type = 'mp4'
    else:
        video_url = r.html.search('var vHLSurl    = "{}";')[0]
        video_type = 'm3u8'
    return file_name, video_url, video_type


# 保存文件
def save(file_name, video_url, video_type):
    if video_type == 'mp4':
        file_name += '.mp4'
        r = session.get(url=video_url)
        with open(file_name, 'wb') as f:
            f.write(r.content)
    elif video_type == 'm3u8':
        save_m3u8(file_name, video_url)


# 处理m3u8
def save_m3u8(file_name, video_url):
    if not os.path.exists(file_name):
        os.mkdir(file_name)
    r = session.get(url=video_url)
    m3u8_path = os.path.join(file_name, 'playlist.m3u8')
    with open(m3u8_path, 'wb') as f:
        f.write(r.content)
    # print(r.text)
    for line in r.text:
        if line.endswith('ts'):
            ts_url = video_url.replace('playlist.m3u8', line)
            ts_path = os.path.join(file_name, line)
            r1 = session.get(url=ts_url)
            with open(ts_path, 'wb') as f:
                f.write(r1.content)


if __name__ == '__main__':
    for index_page in get_index_page():
        for detail_url in get_detail_page(index_page):
            file_name, video_url, video_type = get_url_name(detail_url)
            save(file_name, video_url, video_type)

免责声明:文章转载自《爬虫小案例——爬取校花网》仅用于学习参考。如对内容有疑问,请及时联系本站处理。

上篇Python 全栈开发:python字典dict爬虫小案例——爬取网站小说下篇

宿迁高防,2C2G15M,22元/月;香港BGP,2C5G5M,25元/月 雨云优惠码:MjYwNzM=

相关文章

Python爬虫指定关键词下载百度图片的角本

一、角本 把代码中word=XXX更换成你要搜索的词即可。 #!/usr/bin/env python #-*- coding: utf-8 -*- importjson importitertools importurllib importrequests importos importre importsys str_table...

爬虫_斗图啦(队列,多线程)

1 importthreading 2 importrequests 3 from lxml importetree 4 from urllib importrequest 5 importos 6 importre 7 from queue importQueue 8 9 10 classProducer(threading.Thread): 11...

day7-Python学习笔记(十五)网络编程

import urllib.requestimport json,requests#发送get请求# url = 'http://api.nnzhp.cn/api/user/stu_info?stu_name=小黑马'# req = requests.get(url) #发送get请求# print(req.text) #获取结果# print(req.js...

springMVC中的Controller里面定义全局变量

使用SpringMVC的时候,如果想要在Controller中定义一个全局变量,并且实现在不同用户访问程序的时候,所得到的全局变量不一样的(线程安全的),这个时候就可以用Spring的注解@Scope来实现: @Controller//把这个bean 的范围设置成session,表示这bean是会话级别的,@Scope("session")public cl...

对提示框的操作

对提示框做练习的时候掉进了一个坑,在这里提醒各位亲们,千万别跟我一样掉进这个坑里了 from selenium import webdriverfrom selenium.webdriver.common.action_chains import ActionChainsimport timedriver = webdriver.Chrome()driver...

url加密,一般只对参数加密

首先,很不推荐你使用get方式发送密码,最好是使用post.原因是,你通过一个连接把用户名和密码发送到后台,即便密码不是明文,别人获取不到密码明文,但是,只要你这个连接成功登陆过,别人就可以拿这个连接到处登陆.密码明文加密完全形同虚设.如果非想使用get方式发送,我可以给你个思路,就是表单附带发送令牌,这个令牌是表单内的隐藏域,后台里对每一个时刻都不同的字符...

最新文章