Python批量转换HTML为PDF | Medivh's castle
wkhtmltopdf
简介
wkhtmltopdf and wkhtmltoimage are open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely “headless” and do not require a display or display service.
wkhtmltopdf 和 wkhtmltoimage是一个开元的命令行工具,用来转换html为pdf和各种图像格式。
安装
下载地址:https://wkhtmltopdf.org/downloads.html
mac的话可以直接安装了,其他系统就看着办吧。
1 |
brew install Caskroom/cask/wkhtmltopdf |
使用方式
- Download a precompiled binary or build from source
- Create your HTML document that you want to turn into a PDF (or image)
- Run your HTML document through the tool.
- For example, if I really like the treatment Google has done to their logo today and want to capture it forever as a PDF:
1
wkhtmltopdf http://google.com google.pdf
下载安装-》创建HTML文件-》命令行执行
Pdfkit
A JavaScript PDF generation library for Node and the browser.
简介
PDFKit is a PDF document generation library for Node and the browser that makes creating complex, multi-page, printable documents easy. It’s written in CoffeeScript, but you can choose to use the API in plain 'ol JavaScript if you like. The API embraces chainability, and includes both low level functions as well as abstractions for higher level functionality. The PDFKit API is designed to be simple, so generating complex documents is often as simple as a few function calls.
pdfkit 是 wkhtmltopdf 的Python封装包。
安装
123 |
npm install pdfkitor pip install pdfkit |
支持模块
支持以下方式:
- URL
- 文件
- 字符串
123 |
pdfkit.from_url('https://www.google.com.hk','out1.pdf') pdfkit.from_file('123.html','out2.pdf') pdfkit.from_string('Hello!','out3.pdf') |
代码示例
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849 |
#!usr/bin/env python# -*- coding:utf-8 _*-"""@author:medivh@file: html_to_pdf.py@time: 2018/12/20""" import pdfkitimport osimport threading src = '/Users/medivh/Downloads/tmp/new/'desc = '/Users/medivh/Downloads/tmp/new-pdf/'# read file path and destination pathsem = threading.Semaphore(10)#控制线程数量try: os.mkdir(desc)except: passdef ToPdf(filename): with sem: try: with open(src + filename, encoding="utf-8") as f: pdf_name = desc + filename[:-6] + '.pdf' #拼接文件名 pdfkit.from_file(f, pdf_name) except: print(filename) threads = list()for i in os.listdir(src): t = threading.Thread(target=ToPdf, args=(i,)) threads.append(t) if __name__ == '__main__': for t in threads: t.setDaemon(True) t.start() for t in threads: t.join() start = len(os.listdir(src)) end = len(os.listdir(desc)) print(start,end) if start == end: print('ok') else: print('no') |
问题总结
'ascii' codec can't decode byte 0xb4 in position 11: ordinal not in range(128)
- 解决:加上encoding,
with open(src + filename, encoding="utf-8")
- 解决:加上encoding,
- 注意文件数量,否则数量太大而且没设置线程数的话机器会卡死
- 解决:使用
threading.Semaphore(10)
- 解决:使用