真是太白了,python之路还有很长,今天我从这里开始,留作自己备忘。2018-04-05
花了一个下午学习个爬小说的,总的来说是因为自己没什么基础,哪里不会补哪里,磕磕绊绊的,总算是能运行,先把代码放这里,以后请教高手帮助解决一下。
# -*- coding: utf-8 -*-# @Time : 2018/4/5 13:46# @Author : ELEVEN# @File : crawerl--小说网.py# @Software: PyCharmimport requestsimport reimport timeimport osheader = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:59.0) Gecko/20100101 Firefox/59.0'}def get_type_list(i): url = 'http://www.quanshuwang.com/list/{}_1.html'.format(i) html = requests.get(url, headers = header) html.encoding = 'gbk' html = html.text # print(html) # lis = re.findall(r'
- .*?
- ', html, re.S) novel_list = re.findall(r'', html, re.S) return novel_listdef get_chapter_list(type_url): html = requests.get(type_url, headers = header) html.encoding = 'gbk' html = html.text novel_chapter_html = re.findall(r'', html, re.S)[0] html = requests.get(novel_chapter_html) html.encoding = 'gbk' html = html.text novel_chapter = \ re.findall(r'
- (.*?) ', html, re.S) # print(novel_chapter) # exit() return novel_chapterdef get_chapter_info(chapter_url): html = requests.get(chapter_url, headers = header) html.encoding = 'gbk' html = html.text # print(html) # exit() chapter_info = re.findall( r'
- ', html, re.S) # lis = re.findall(r'
(.*?)
', html, re.S)[0] # print(chapter_info) # exit() return chapter_infoif __name__ == '__main__': sort_dict = { 1:'玄幻魔法', 2:'武侠修真', 3:'纯爱耽美', 4:'都市言情', 5:'职场校园', 6:'穿越重生', 7:'历史军事', 8:'网游动漫', 9:'恐怖灵异', 10:'科幻小说', 11:'美文名著'} try: if not os.path.exists('全书网'): os.mkdir('全书网') for sort_id, sort_name in sort_dict.items(): if not os.path.exists('%s/%s'%('全书网', sort_name)): os.mkdir('%s/%s'%('全书网', sort_name)) # print('分类名称:', sort_name) for type_name,type_url in get_type_list(sort_id): # print(type_name, type_url) # if not os.path.exists('%s/%s/%s.txt'%('全书网', sort_name, type_name)): # os.mkdir('%s/%s/%s.txt'%('全书网', sort_name, type_name)) for chapter_url, chapter_name in get_chapter_list(type_url): # [::-1]代表列表反向输出 # print(chapter_url, chapter_name, chapter_time) # print(get_chapter_info(chapter_url)) with open('%s/%s/%s.txt'%('全书网', sort_name, type_name), 'a') as f: print('正在保存...',chapter_name) f.write('\n' + chapter_name + '\n') f.write(get_chapter_info(chapter_url)) except OSError as reason: print('wrong') print('问题原因是%s'%str(reason)) 没解决的问题:
1、问题原因:('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
自己分析:可能是因为反复访问服务器,服务器认为我是机器人,被反爬了,文件头也有换,爬个几本小说就会出错。
解决结果:没有解决。