Python 提取 Word 文档批注中的文本和图片

Python 从 Word 批注中提取文本

你可以使用 Spire.Doc for Python 提供的 Comment.Format.Author 和 Comment.Body.Paragraphs[index].Text 属性获取 Word 批注的作者和文本。详细步骤如下:

创建 Document 类的对象。

使用 Document.LoadFromFile() 方法加载 Word 文档。

创建一个列表来存储提取的批注数据。

遍历文档中的批注。

遍历每个批注中的段落。

使用 Comment.Body.Paragraphs[index].Text 属性获取每个段落的文本。

使用 Comment.Format.Author 属性获取批注的作者。

将批注的文本和作者添加到列表中。

将列表的内容保存到文本文件。

Python

from spire.doc import *
from spire.doc.common import *

# 创建一个 Document 类的对象
document = Document()
# 加载包含批注的 Word 文档
document.LoadFromFile("批注.docx")

# 创建一个列表来存储提取的批注数据
comments = []

# 遍历文档中的批注
for i in range(document.Comments.Count):
    ccomment = document.Comments.get_Item(i)
    comment_text = ""

    # 遍历批注正文中的段落
    for j in range(comment.Body.Paragraphs.Count):
        paragraph = comment.Body.Paragraphs.get_Item(j)
        comment_text += paragraph.Text + "\n"

    # 获取批注作者
    comment_author = comment.Format.Author

    # 将批注数据添加到列表中
    comments.append({
        "作者": comment_author,
        "内容": comment_text
    })

# 将批注数据写入文件
with open("批注.txt", "w", encoding="utf-8") as file:
    for i, comment in enumerate(comments, start=1):
        file.write(f"批注{i}:\n  作者: {comment['作者']}\n  批注内容: {comment['内容']}\n")

document.Close()

Python 从 Word 批注中提取图片

要从 Word 批注中提取图片，需要遍历批注段落中的子对象，找到 DocPicture 对象，然后使用 DocPicture.ImageBytes 属性获取图片数据，最后将图片数据保存为图片文件。

具体步骤如下:

创建 Document 类的对象。

使用 Document.LoadFromFile() 方法加载 Word 文档。

创建一个列表来存储提取的图片数据。

遍历文档中的批注。

遍历每个批注中的段落。

遍历每个段落的子对象。

检查对象是否为 DocPicture 对象。

如果对象是 DocPicture，使用 DocPicture.ImageBytes 属性获取图片数据，并将其添加到列表中。

将列表中的图片数据保存为单独的图片文件。

Python

from spire.doc import *
from spire.doc.common import *

# 创建一个 Document 类的对象
document = Document()
# 加载包含批注的 Word 文档
document.LoadFromFile("图片批注.docx")

# 创建一个列表来存储提取的图片数据
images = []

# 遍历文档中的批注
for i in range(document.Comments.Count):
    comment = document.Comments.get_Item(i)
    # 遍历批注正文中的段落
    for j in range(comment.Body.Paragraphs.Count):
        paragraph = comment.Body.Paragraphs.get_Item(j)
        # 遍历段落中的子对象
        for o in range(paragraph.ChildObjects.Count):
            obj = paragraph.ChildObjects[o]
            # 查找图片
            if isinstance(obj, DocPicture):
                picture = obj
                # 获取图片数据并添加到列表中
                data_bytes = picture.ImageBytes
                images.append(data_bytes)

# 将图片数据保存为图片文件
for i, image_data in enumerate(images):
    file_name = f"批注图片-{i}.png"
    with open(os.path.join("批注图片/", file_name), 'wb') as image_file:
        image_file.write(image_data)

document.Close()

申请临时 License