Python批量转换HTML为PDF | Medivh's castle

wkhtmltopdf

简介

wkhtmltopdf and wkhtmltoimage are open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely “headless” and do not require a display or display service.

wkhtmltopdf 和 wkhtmltoimage是一个开元的命令行工具,用来转换html为pdf和各种图像格式。

安装

下载地址:https://wkhtmltopdf.org/downloads.html
mac的话可以直接安装了,其他系统就看着办吧。

1
brew install Caskroom/cask/wkhtmltopdf

使用方式

  • Download a precompiled binary or build from source
  • Create your HTML document that you want to turn into a PDF (or image)
  • Run your HTML document through the tool.
  • For example, if I really like the treatment Google has done to their logo today and want to capture it forever as a PDF:
    1
    wkhtmltopdf http://google.com google.pdf

下载安装-》创建HTML文件-》命令行执行

Pdfkit

A JavaScript PDF generation library for Node and the browser.

简介

PDFKit is a PDF document generation library for Node and the browser that makes creating complex, multi-page, printable documents easy. It’s written in CoffeeScript, but you can choose to use the API in plain 'ol JavaScript if you like. The API embraces chainability, and includes both low level functions as well as abstractions for higher level functionality. The PDFKit API is designed to be simple, so generating complex documents is often as simple as a few function calls.

pdfkit 是 wkhtmltopdf 的Python封装包。

安装

123
npm install pdfkitor pip install pdfkit

支持模块

支持以下方式:

  • URL
  • 文件
  • 字符串
123
pdfkit.from_url('https://www.google.com.hk','out1.pdf')   pdfkit.from_file('123.html','out2.pdf')  pdfkit.from_string('Hello!','out3.pdf')

代码示例

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
#!usr/bin/env python# -*- coding:utf-8 _*-"""@author:medivh@file: html_to_pdf.py@time: 2018/12/20"""

import pdfkitimport osimport threading

src = '/Users/medivh/Downloads/tmp/new/'desc = '/Users/medivh/Downloads/tmp/new-pdf/'# read file path and destination pathsem = threading.Semaphore(10)#控制线程数量try:    os.mkdir(desc)except:    passdef ToPdf(filename):    with  sem:        try:            with open(src + filename, encoding="utf-8") as f:                pdf_name = desc + filename[:-6] + '.pdf'                #拼接文件名                pdfkit.from_file(f, pdf_name)        except:            print(filename)

threads = list()for i in os.listdir(src):    t = threading.Thread(target=ToPdf, args=(i,))    threads.append(t)

if __name__ == '__main__':    for t in threads:        t.setDaemon(True)        t.start()    for t in threads:        t.join()    start = len(os.listdir(src))    end = len(os.listdir(desc))    print(start,end)    if start == end:        print('ok')    else:        print('no')

问题总结

  • 'ascii' codec can't decode byte 0xb4 in position 11: ordinal not in range(128)

    • 解决:加上encoding,with open(src + filename, encoding="utf-8")
  • 注意文件数量,否则数量太大而且没设置线程数的话机器会卡死
    • 解决:使用threading.Semaphore(10)

参考资料

http://pdfkit.org/
https://wkhtmltopdf.org/index.html

(0)

相关推荐