爬虫学习记录：动漫之家漫画爬取

爬虫时效性低，代码可能随时间失效，但爬虫思路不变

获取漫画章节名

首先，随便在排行榜上选了一个比较短的漫画《雪屋》 (没看过，并非推荐)

审查一下元素，发现代表章节的都在class属性为list_con_li的<ul>标签下的<a>标签中，但是是倒序的

所以我们找到包含对应class的<ul>，然后遍历里面的<a>标签，通过往dict前端插入将倒序调整成顺序

import requests
from bs4 import BeautifulSoup
 
target_url = "https://www.dmzj.com/info/xuewu.html"
r = requests.get(url=target_url)
bs = BeautifulSoup(r.text, 'html')
list_con_li = bs.find('ul', class_="list_con_li")
comic_list = list_con_li.find_all('a')
chapter_names = []
chapter_urls = []
for comic in comic_list:
    href = comic.get('href')
    name = comic.text
    chapter_names.insert(0, name)
    chapter_urls.insert(0, href)
 
print(chapter_names)
print(chapter_urls)

“啪”的一下，很快就出来了

获取页面内容

然后我们尝试爬取一个章节的内容，打开预告章节

预告章节地址

打开之后，发现后面被自动加上了#@page=1，翻页的时候会变成#@page=2，但是这显然不是这一页图片的地址

审查元素，图片的地址在这里

当然在html里面找还是有点费劲

有个更简单的办法

在network里面重加载一下页面，然后在img里面找，可以更容易地找到这张对应的图片

不得不吐槽一下这个动漫之家的广告也太多了

然后用这个地址，在html里面CTRL+F，就能在html里面定位对应的位置

然后审查元素，尝试抓取一下这张图的地址

import requests
from bs4 import BeautifulSoup

target_url = "https://www.dmzj.com/view/xuewu/109248.html"
r = requests.get(url=target_url)
bs = BeautifulSoup(r.text, features="html.parser")

comic_wraCon = bs.find('div', class_="comic_wraCon autoHeight")
imgs = comic_wraCon.find_all('img')
for img in imgs:
    src = img.get('src')
    print(src)

结果是什么都没抓到

这里我们采用view-source来检查一下原页面

view-source可以忽略动态加载的内容，查看网页源码

具体用法就是在网址前加上view-source:

在source里面查找对应的class，发现是空的，怪不得抓不到图片地址，因为图片是动态加载的

看一下图片的地址

1	https://images.dmzj.com/img/chapterpic/33872/127406/16007394818532.jpg

我们在source里面搜16007394818532

找到如下代码

图片的链接就藏在里面

我们用正则表达式去匹配

import requests
from bs4 import BeautifulSoup
import re

url = 'https://www.dmzj.com/view/xuewu/109248.html'
r = requests.get(url=url)
html = BeautifulSoup(r.text, features="html.parser")
script_info = html.script
pics = re.findall('\d{14}', str(script_info))
chapter1 = re.findall('\|(\d{5})\|', str(script_info))[0]
chapter2 = re.findall('\|(\d{6})\|', str(script_info))[0]
for pic in pics:
    url = 'https://images.dmzj.com/img/chapterpic/' + chapter1 + '/' + chapter2 + '/' + pic + '.jpg'
    print(url)

正则表达式re中，\d{x}表示在字符串中获取长度为x的连续数字

chapter1和2这里取数组第一个元素是因为输出看了一下发现后面的不对劲，就比如chapter1的正则数组[‘127406’, ‘109248’]，第二个数是页面的尾号，不构成图片地址

跟网页图片顺序比对了一下，发现没问题

那我们就可以随便挑选一张图片来下载一下了

from urllib.request import urlretrieve

dn_url = 'https://images.dmzj.com/img/chapterpic/33872/127406/16007394818532.jpg'
urlretrieve(dn_url,'1.jpg')

然后就可以得到一个

urllib.error.HTTPError: HTTP Error 403: Forbidden

403表示资源不可用

我们打开图片链接查看

发现也是403，但是从站内点开的同一个地址却是图片，只是刷新一下之后也会变成403

说明这是一种通过Referer的反爬手段，仅允许站内用户访问

所以我们得在requests请求中加入包含referer的header

import requests
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
from contextlib import closing
import re

download_header = {
    'Referer': 'https://www.dmzj.com/view/yaoshenji/41917.html'
}

dn_url = 'https://images.dmzj.com/img/chapterpic/33872/127406/16007394818532.jpg'
with closing(requests.get(dn_url, headers=download_header, stream=True)) as response:
    if response.status_code == 200:
        with open('1.jpg', "wb") as file:
            for data in response:
                file.write(data)
    else:
        print('链接异常')
print('下载完成！')

status_code=200表示请求已成功，请求所希望的响应头或数据体将随此响应返回

成功下载图片

代码总览

在pics这里还是加了个排序，因为第一次爬下来的时候发现除了预告章节之外剩下的顺序都不对，头秃

import requests
import os
import re
from bs4 import BeautifulSoup
from contextlib import closing
from tqdm import tqdm

# 创建保存目录
save_dir = '雪屋'
if save_dir not in os.listdir('./'):
    os.mkdir(save_dir)

target_url = "https://www.dmzj.com/info/xuewu.html"

# 获取章节链接和章节名
r = requests.get(url=target_url)
bs = BeautifulSoup(r.text)
list_con_li = bs.find('ul', class_="list_con_li")
cartoon_list = list_con_li.find_all('a')
chapter_names = []
chapter_urls = []
for cartoon in cartoon_list:
    href = cartoon.get('href')
    name = cartoon.text
    chapter_names.insert(0, name)
    chapter_urls.insert(0, href)

# 下载漫画
for i, url in enumerate(tqdm(chapter_urls)):
    download_header = {
        'Referer': url
    }
    name = chapter_names[i]
    chapter_save_dir = os.path.join(save_dir, name)
    if name not in os.listdir(save_dir):
        os.mkdir(chapter_save_dir)
        r = requests.get(url=url)
        html = BeautifulSoup(r.text)
        script_info = html.script
        pics = re.findall('\d{14}', str(script_info))
        pics = sorted(pics, key=lambda x: int(x))
        chapter1 = re.findall('\|(\d{5})\|', str(script_info))[0]
        chapter2 = re.findall('\|(\d{6})\|', str(script_info))[0]
        for idx, pic in enumerate(pics):
            url = 'https://images.dmzj.com/img/chapterpic/' + chapter1 + '/' + chapter2 + '/' + pic + '.jpg'
            pic_name = '%03d.jpg' % (idx + 1)
            pic_save_path = os.path.join(chapter_save_dir, pic_name)
            with closing(requests.get(url, headers=download_header, stream=True)) as response:
                if response.status_code == 200:
                    with open(pic_save_path, "wb") as file:
                        for data in response:
                            file.write(data)
                else:
                    print('链接异常')

print('下载成功')