有什么方法可以使用Python-docx读取包含自动编号的.docx文件吗？

22 人关注

问题陈述。从.docx文件中提取章节，包括自动编号。

我试着用Python-docx从.docx文件中提取文本，但它排除了自动编号的功能。

from docx import Document
document = Document("wadali.docx")
def iter_items(paragraphs):
    for paragraph in document.paragraphs:
        if paragraph.style.name.startswith('Agt'):
            yield paragraph
        if paragraph.style.name.startswith('TOC'):
            yield paragraph
        if paragraph.style.name.startswith('Heading'):
            yield paragraph
        if paragraph.style.name.startswith('Title'):
            yield paragraph
        if paragraph.style.name.startswith('Heading'):
            yield paragraph
        if paragraph.style.name.startswith('Table Normal'):
            yield paragraph
        if paragraph.style.name.startswith('List'):
            yield paragraph
for item in iter_items(document.paragraphs):
    print item.text


         5
         
         个评论


           
            你能否提供一个最基本的工作实例，以便我们能够重现你的问题并进行处理？


           
            user8682794
           
           ：


           
            你不能这样做。没有API支持，我甚至不确定你能从XML源中提取这个。


           
            @Sharku编辑的问题添加了我的工作与docx。


           
            @PearlySpencer 是否有任何其他的库或源可以帮助提取带有自动编号的文本？


           
            ISO/IEC 29500-1:2012(E)的17.9.16节 @PearlySpencer


         python


         docx


         python-docx


        2
        
        个回答


          
           
           
            Laurent LAPORTE
           
          
          
           发布于
           
           2020-04-25


          已采纳


         0
         
         人赞同


          
           目前看来
           
            python-docx
           
           v0.8并不完全支持编号。你需要做一些黑客的工作。
          
          
           首先，对于演示来说，为了迭代文档段落，你需要编写你自己的迭代器。
这里有一些功能。
          
          import docx.document
import docx.oxml.table
import docx.oxml.text.paragraph
import docx.table
import docx.text.paragraph
def iter_paragraphs(parent, recursive=True):
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    if isinstance(parent, docx.document.Document):
        parent_elm = parent.element.body
    elif isinstance(parent, docx.table._Cell):
        parent_elm = parent._tc
    else:
        raise TypeError(repr(type(parent)))
    for child in parent_elm.iterchildren():
        if isinstance(child, docx.oxml.text.paragraph.CT_P):
            yield docx.text.paragraph.Paragraph(child, parent)
        elif isinstance(child, docx.oxml.table.CT_Tbl):
            if recursive:
                table = docx.table.Table(child, parent)
                for row in table.rows:
                    for cell in row.cells:
                        for child_paragraph in iter_paragraphs(cell):
                            yield child_paragraph
你可以用它来查找所有的文档段落，包括表格单元格中的段落。
import docx
document = docx.Document("sample.docx")
for paragraph in iter_paragraphs(document):
    print(paragraph.text)
要访问编号属性，你需要在 "受保护 "成员中搜索paragraph._p.pPr.numPr，这是一个docx.oxml.numbering.CT_NumPr对象。
for paragraph in iter_paragraphs(document):
    num_pr = paragraph._p.pPr.numPr
    if num_pr is not None:
        print(num_pr)  # type: docx.oxml.numbering.CT_NumPr
注意，这个对象是从numbering.xml文件（在docx里面）中提取的，如果它存在的话。
To access it, you need to read your docx file like a package. 比如说。
import docx.package
import docx.parts.document
import docx.parts.numbering
package = docx.package.Package.open("sample.docx")
main_document_part = package.main_document_part
assert isinstance(main_document_part, docx.parts.document.DocumentPart)
numbering_part = main_document_part.numbering_part
assert isinstance(numbering_part, docx.parts.numbering.NumberingPart)
ct_numbering = numbering_part._element
print(ct_numbering)  # CT_Numbering
for num in ct_numbering.num_lst:
    print(num)  # CT_Num
    print(num.abstractNumId)  # CT_DecimalNumber
Mor信息可在Office Open XMl文件。


           
            
             
              
               
                Daniel Luevano Alonso
               
               ：


           
            
             
              
               
                It's printing to me: <CT_DecimalNumber '<w:abstractNumId>' at 0x10d0eef40> AND <CT_Num '<w:num>' at 0x10d0eecc0> not the decimal value, any ideas?


          
           
            
             
              
               有一个软件包，docx2python，它以一种更简单的方式来做这件事：pypi.org/project/docx2python/。
              
              
               The following code:
              
              from docx2python import docx2python