python读取pdf文件,python读取pdf文件内容

python读取pdf文件

摘要：

pdfplumber简介Pdfplumber是一个可以处理pdf格式信息的库。文档参考https://github.com/jsvine/pdfplumberpdfplumber安装安装直接采用pip即可。命令行中输入pipinstallpdfplumber如果要进行可视化的调试，则需要安装ImageMagick。PdfplumberGitHub：https://github.com/jsvine/pdfplumberImageMagick地址：http://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-windows（注意：我在装ImageMagick，使用起来是报错了，网上参照了这里了解到应该装6x版，7x版会报错。（注意，一定要下载32位版本，哪怕Windows和python的版本是64位的。）GhostScript:https://www.ghostscript.com/download/gsdnld.html简单使用importpdfplumberwithpdfplumber.openaspdf:first_page=pdf.pages[0]#获取第一页printpdfplumber.pdf中包含了.metadata和.pages两个属性。

pdfplumber简介

Pdfplumber是一个可以处理pdf格式信息的库。可以查找关于每个文本字符、矩阵、和行的详细信息，也可以对表格进行提取并进行可视化调试。

文档参考https://github.com/jsvine/pdfplumber

pdfplumber安装

安装直接采用pip即可。命令行中输入

pip install pdfplumber

如果要进行可视化的调试，则需要安装ImageMagick。
Pdfplumber GitHub： https://github.com/jsvine/pdfplumber
ImageMagick地址：
http://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-windows
（官网地址没有6x， 6x地址：https://imagemagick.org/download/binaries/）

（注意：我在装ImageMagick，使用起来是报错了，网上参照了这里了解到应该装6x版，7x版会报错。故找了6x的地址如上。）

在使用to_image函数输出图片时，如果报错DelegateException。则安装GhostScript 32位。（注意，一定要下载32位版本，哪怕Windows和python的版本是64位的。）
GhostScript: https://www.ghostscript.com/download/gsdnld.html

简单使用

importpdfplumber
with pdfplumber.open("path/file.pdf") as pdf:
    first_page = pdf.pages[0]  #获取第一页
    print(first_page.chars[0])

pdfplumber.pdf中包含了.metadata和.pages两个属性。
metadata是一个包含pdf信息的字典。
pages是一个包含页面信息的列表。

每个pdfplumber.page的类中包含了几个主要的属性。
page_number 页码
width 页面宽度
height 页面高度
objects/.chars/.lines/.rects 这些属性中每一个都是一个列表，每个列表都包含一个字典，每个字典用于说明页面中的对象信息，包括直线，字符，方格等位置信息。

常用方法

extract_text() 用来提页面中的文本，将页面的所有字符对象整理为的那个字符串
extract_words() 返回的是所有的单词及其相关信息
extract_tables() 提取页面的表格
to_image() 用于可视化调试时，返回PageImage类的一个实例

常用参数

table_settings

表提取设置

默认情况下，extract_tables使用页面的垂直和水平线（或矩形边）作为单元格分隔符。但是方法该可以通过table_settings参数高度定制。可能的设置及其默认值：

{
    "vertical_strategy": "lines", 
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "join_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "keep_blank_chars": False,
    "text_tolerance": 3,
    "text_x_tolerance": None,
    "text_y_tolerance": None,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": None,
    "intersection_y_tolerance": None,
}

表提取策略

vertical_strategy和horizontal_strategy的参数选项

`"lines"`	Use the page's graphical lines —including the sides of rectangle objects —as the borders of potential table-cells.
`"lines_strict"`	Use the page's graphical lines —but notthe sides of rectangle objects —as the borders of potential table-cells.
`"text"`	For `vertical_strategy`: Deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. For `horizontal_strategy`, the same but using the tops of words.
`"explicit"`	Only use the lines explicitly defined in `explicit_vertical_lines`/ `explicit_horizontal_lines`.

举例使用

读取文字

importpdfplumber
importpandas as pd
with pdfplumber.open("E:\600aaa_2.pdf") as pdf:
    page_count =len(pdf.pages)
    print(page_count)  #得到页数
    for page inpdf.pages:
        print('---------- 第[%d]页 ----------' %page.page_number)
        #获取当前页面的全部文本信息，包括表格中的文字
        print(page.extract_text())

读取表格

importpdfplumber
importpandas as pd
importre
with pdfplumber.open("E:\600aaa_1.pdf") as pdf:
    page_count =len(pdf.pages)
    print(page_count)  #得到页数
    for page inpdf.pages:
        print('---------- 第[%d]页 ----------' %page.page_number)
        for pdf_table in page.extract_tables(table_settings={"vertical_strategy": "text",
                                                         "horizontal_strategy": "lines",
                                                        "intersection_tolerance":20}): #边缘相交合并单元格大小
            #print(pdf_table)
            for row inpdf_table:
                #去掉回车换行
                print([re.sub('s+', '', cell) if cell is not None else None for cell in row])

部分参照：https://blog.csdn.net/Elaine_jm/article/details/84841233

免责声明：文章转载自《python读取pdf文件》仅用于学习参考。如对内容有疑问，请及时联系本站处理。

python读取pdf文件

pdfplumber简介

pdfplumber安装

简单使用

常用方法

常用参数

table_settings

表提取策略

举例使用

相关文章

python设计模式之猴子补丁模式

python并发编程之多线程2死锁与递归锁，信号量等

Python—猜年龄游戏升级版

python字符串与字节序列

作业一统计软件简介与数据操作

python实现生成 json web token

最新文章

随机推荐

思享工具箱导航

JSON工具

格式化转换

加解密编码

文本数字

网络

站长

计算

其他

对照列表

python读取pdf文件

pdfplumber简介

pdfplumber安装

简单使用

常用方法

常用参数

table_settings

表提取策略

举例使用

相关文章

python设计模式之猴子补丁模式

python并发编程之多线程2死锁与递归锁，信号量等

Python—猜年龄游戏升级版

python字符串与字节序列

作业一 统计软件简介与数据操作

python实现生成 json web token

最新文章

随机推荐

思享工具箱导航

JSON工具

格式化转换

加解密编码

文本数字

网络

站长

计算

其他

对照列表

作业一统计软件简介与数据操作