为了学爬虫,我用三步爬取了大佬崔庆才爬虫相关文章,并保持为pdf学习
为了学习网络爬虫,我爬取了网络爬虫届大佬崔庆才的所有Python博客,并转换成了pdf,以便后续学习。
1.代码思路
- 获取所有博客的URL
- 获得每篇博客的html内容,并转化为pdf文件
- 合并pdf文件
2.获取所有博客URL
首先,通过崔老师的博客网站可知,目前Python博客内容包含7页,如下图
通过这些博客分类页面,很方面就能获得每篇博客的网址,代码如下:
#获取所有URLdef get_url(): url_list = [] for i in range(7,0,-1): if(i==1): url = "https://cuiqingcai.com/categories/Python" else: url = "https://cuiqingcai.com/categories/Python/page/{number}/".format(number=str(i)) driver.get(url) time.sleep(5) soup = BeautifulSoup(driver.page_source,'lxml') soup = soup.find('div',class_='posts-collapse').find_all('article') for a in soup[::-1]: url_list.append('https://cuiqingcai.com' a.find('a',class_='post-title-link')['href']) return url_list
3.将每篇博客内容保存为PDF
在将博客内容保存为pdf的过程中,主要使用Python的pdfkit扩展包,pdfkit扩展包能够非常方便的将网页内容或者HTML以及字符串等保存为pdf文件。
pdfkit直接使用pip install pdfkit
语句安装即可。
pdfkit扩展包的使用是基于wkhtmltopdf应用程序的,因此在此之前需要先下载安装wkhtmltopdf,安装完成之后需要将其添加的环境变量中,或者在使用pdfkit时以参数的形式指明该程序的路径。
wkhtmltopdf稳定版下载地址:https://wkhtmltopdf.org/downloads.html
将博客内容保存为pdf的具体代码如下:
def html_to_pdf(url_list,html_template,options,config): for artical_url in url_list: driver.get(artical_url) time.sleep(5) soup = BeautifulSoup(driver.page_source,'lxml') title = soup.find('h1',class_='post-title') body = soup.find('div',class_='post-body') body.find('div',class_='advertisements').clear()#去除带css格式的广告 html = html_template.format(title=title,content=body) print(title,':',artical_url)# print(html) try: pdfkit.from_string(html, r"D:\toPDF\{name}.pdf".format(name=title.getText()),configuration=config,options=options) except OSError: pass print('文章:',title,'已转化为PDF')
4.合并pdf文件
python中,可以使用PyPDF2扩展包非常方便的合并pdf文件,使用pip install PyPDF2
即可安装该扩展包。
def merge_pdf(file_path): merger = PdfFileMerger() for root, dirs, files in os.walk(file_path): files = files for file in files: print(file) inputPDF = open(file_path '\\' file, "rb") merger.append(inputPDF,import_bookmarks=False) inputPDF.close() output = open(r"D:\toPDF\allPdfMerge.pdf", "wb") merger.write(output)
5.完整代码
通过上面几个步骤就完成了对博客的爬取,以及转换成最终的pdf,完成代码如下:
from selenium import webdriverfrom bs4 import BeautifulSoupimport pdfkitimport PyPDF2import timeimport os#获取所有URLdef get_url(): url_list = [] for i in range(7,0,-1): if(i==1): url = "https://cuiqingcai.com/categories/Python" else: url = "https://cuiqingcai.com/categories/Python/page/{number}/".format(number=str(i)) driver.get(url) time.sleep(5) soup = BeautifulSoup(driver.page_source,'lxml') soup = soup.find('div',class_='posts-collapse').find_all('article') for a in soup[::-1]: url_list.append('https://cuiqingcai.com' a.find('a',class_='post-title-link')['href']) return url_listdef html_to_pdf(url_list,html_template,options,config): for artical_url in url_list: driver.get(artical_url) time.sleep(5) soup = BeautifulSoup(driver.page_source,'lxml') title = soup.find('h1',class_='post-title') body = soup.find('div',class_='post-body') body.find('div',class_='advertisements').clear()#去除带css格式的广告 html = html_template.format(title=title,content=body) print(title,':',artical_url)# print(html) try: pdfkit.from_string(html, r"D:\toPDF\{name}.pdf".format(name=title.getText()),configuration=config,options=options) except OSError: pass print('文章:',title,'已转化为PDF')def merge_pdf(file_path): merger = PdfFileMerger() for root, dirs, files in os.walk(file_path): files = files for file in files: print(file) inputPDF = open(file_path '\\' file, "rb") merger.append(inputPDF,import_bookmarks=False) inputPDF.close() output = open(r"D:\toPDF\allPdfMerge.pdf", "wb") merger.write(output) def main():driver = webdriver.Firefox()url_list = get_url()html_template = """ <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> </head> <p>{title}</p><body> {content} </body> </html> """#将HTML保存为PDFoptions = {'page-size':'Letter','margin-top':'0.75in','margin-right':'0.75in','margin-bottom':'0.75in','margin-left':'0.75in','encoding':"UTF-8",'custom-header': [('Accept-Encoding','gzip')],'cookie': [('cookie-name1','cookie-value1'),('cookie-name2','cookie-value2'),],'outline-depth':10,}path_wk = r"D:\wkhtmltopdf\bin\wkhtmltopdf.exe" #wkhtmltopdf安装位置config = pdfkit.configuration(wkhtmltopdf = path_wk)html_to_pdf(url_list,html_template,options,config)file_path = r"D:\toPDF"merge_pdf(file_path)if __name__ == "__main__":main()
ps:本文使用的webdriver来获取的网页html,也可以使用requests.get()方法来获取网页html。
码字不易,喜欢请点赞!!!
我们下次再见,如果还有下次 想进入交流群的,欢迎加我微信,并备注:数据分析交流群
我们共同学习,共同进步!!!
赞 (0)