Python dotx Conversion to docx for Automated Documents

While not exactly security related, I've had to do some Python dotx conversion to docx files recently.

Python dotx Conversion - Introduction

I've been working on a tool (coming soon?) for automating the my pentest engagement organization.

Part of this tool required me to copy over a .dotx template and save it as a .docx file. The reason for this is that our report templates are .dotx by default, but I wanted to start with a blank .docx for each engagement.

Installing python-docx

First, I installed python-docx.

Rays-MBP:tools doyler$ pip install python-docx
Collecting python-docx
  Downloading python-docx-0.8.6.tar.gz (5.3MB)
    100% |████████████████████████████████| 5.3MB 26kB/s
Requirement already satisfied: lxml>=2.3.2 in /usr/local/lib/python2.7/site-packages (from python-docx)
Building wheels for collected packages: python-docx
  Running setup.py bdist_wheel for python-docx ... done
  Stored in directory: /Users/doyler/Library/Caches/pip/wheels/cc/74/10/42b00d7d6a64cf21f194bfef9b94150009ada880f13c5b2ad3
Successfully built python-docx
Installing collected packages: python-docx
Successfully installed python-docx-0.8.6

With this installed, I figured I'd be able to go about implementing it in my script. Unfortunately, python-docx does not yet support dotx files out of the box.

Adding support for dotx files

Based on the above GitHub issue, I needed to make a few simple changes to support dotx files.

First, I added the proper content types to api.py. I added macro enabled templates as well, just in case.

Rays-MBP:__ENGAGEMENTS doyler$ vi /usr/local/lib/python2.7/site-packages/docx/api.py

...

def Document(docx=None):
    """
    Return a |Document| object loaded from *docx*, where *docx* can be
    either a path to a ``.docx`` file (a string) or a file-like object. If
    *docx* is missing or ``None``, the built-in default document "template"
    is loaded.
    """
    docx = _default_docx_path() if docx is None else docx
    document_part = Package.open(docx).main_document_part
    if document_part.content_type != CT.WML_DOCUMENT_MAIN:
        tmpl = "file '%s' is not a Word file, content type is '%s'"
        raise ValueError(tmpl % (docx, document_part.content_type))
    return document_part.document
    if document_part.content_type not in [CT.WML_DOCUMENT_MAIN, 'application/vnd.openxmlformats-officedocument.wordprocessingml.template.main+xml', 'application/vnd.ms-word.document.macroEnabled.main+xml']:

Next, I added the DocumentPart for these content types to the PartFactory in init.

Rays-MBP:__ENGAGEMENTS doyler$ vi /usr/local/lib/python2.7/site-packages/docx/__init__.py

...

PartFactory.part_type_for[CT.WML_DOCUMENT_MAIN] = DocumentPart
PartFactory.part_type_for[CT.WML_DOCUMENT_MAIN] = DocumentPart
PartFactory.part_type_for['application/vnd.openxmlformats-officedocument.wordprocessingml.template.main+xml'] = DocumentPart
PartFactory.part_type_for['application/vnd.ms-word.document.macroEnabled.main+xml'] = DocumentPart

Document Creation

With the monkey-patches in place, it was time to write the script to create my document.

This is a very simple excerpt, but it opens my template, sets the content type, and saves the file with the new extension.

from docx import Document
from docx.opc.constants import CONTENT_TYPE as CT
document = Document('appsec/web_application_assessment_report.dotx')
document_part = document.part
document_part._content_type = CT.WML_DOCUMENT_MAIN
document.save('/Users/doyler/Documents/__ENGAGEMENTS/__DEMO.docx')
Rays-MBP:__ENGAGEMENTS doyler$ file __DEMO.docx
__DEMO.docx: Microsoft OOXML

Python dotx Conversion - Conclusion

While it isn't a huge deal to convert from dotx to docx, that code snippet is making my life easier for now.

It is still not quite ready for release, but here is a screenshot of the output for my newEngagement script.

Python Dotx Conversion - Demo Script

Let me know if you have any ideas or suggestions before I release it! Note that it is still geared towards my specific uses, but is easily modifiable.

doyler on Githubdoyler on Twitter
doyler
Ray Doyle is an avid pentester/security enthusiast/beer connoisseur who has worked in IT for almost 16 years now. From building machines and the software on them, to breaking into them and tearing it all down; he's done it all. To show for it, he has obtained an OSCP, eCPPT, eWPT, eWPTX, eMAPT, Security+, ICAgile CP, ITIL v3 Foundation, and even a sabermetrics certification!

He currently serves as a Senior Penetration Testing Consultant for Secureworks. His previous position was a Senior Penetration Tester for a major financial institution.

When he's not figuring out what cert to get next (currently GXPN) or side project to work on, he enjoys playing video games, traveling, and watching sports.

11 Comments

Filed under Security Not Included

11 Responses to Python dotx Conversion to docx for Automated Documents

  1. Hunter

    I tried this except I tried to convert a .docx to a .dotx. Not working for me (corrupts file) – do you know if there’s a way I can get it to work?

    • Hi Hunter,

      As far as the other direction, you’d have to reverse the content type from main to application/vnd.openxmlformats-officedocument.wordprocessingml.template.main+xml

  2. Hunter

    Has this update been released? Trying to convert .dotx and getting the error

    ValueError: file ‘test-outpufft.dotx’ is not a Word file, content type is ‘application/vnd.openxmlformats-officedocument.wordprocessingml.template.main+xml’

    • No, there has been no updates released for the library itself.

      You’ll have to manually update the two files yourself like I mentioned in the post. Let me know if that makes sense or if you have any questions/issues.

  3. Hunter

    I’m getting an error in api.py:

    def _default_docx_path():
    ^
    IndentationError: expected an indented block

    I’m thinking it’s because of the if statement at the bottom of your first block of code. Here’s what I have in my api.py file:

    def Document(docx=None):
    “””
    Return a |Document| object loaded from *docx*, where *docx* can be
    either a path to a “.docx“ file (a string) or a file-like object. If
    *docx* is missing or “None“, the built-in default document “template”
    is loaded.
    “””
    docx = _default_docx_path() if docx is None else docx
    document_part = Package.open(docx).main_document_part
    if document_part.content_type != CT.WML_DOCUMENT_MAIN:
    tmpl = “file ‘%s’ is not a Word file, content type is ‘%s'”
    raise ValueError(tmpl % (docx, document_part.content_type))
    return document_part.document
    if document_part.content_type not in [CT.WML_DOCUMENT_MAIN, ‘application/vnd.openxmlformats-officedocument.wordprocessingml.template.main+xml’, ‘application/vnd.ms-word.document.macroEnabled.main+xml’]:

    def _default_docx_path():
    “””
    Return the path to the built-in default .docx package.
    “””
    _thisdir = os.path.split(__file__)[0]
    return os.path.join(_thisdir, ‘templates’, ‘default.docx’)

    What am I doing wrong here? Thanks

    • If you are copying and pasting directly from my post, then make sure the spacing/lines are ending up correct.

      That line that begins with if should encompass EVERYTHING until the colon. It looks like you have a spacing/indentation issue somewhere in your code.

  4. Hunter

    It does. The whole if statement up until and including the colon is on one line. The issue is that there’s nothing inside the if statement, so when it gets to def_default_docx_path(): , it sees that it’s not indented ( it expects an indentation since we just did if […]: ). Is there something that’s supposed to be in the if statement?

    • The if statement should be what was modified, and the body of that statement should stay the same in the original file.

      I’ve only posted my modifications, not the file in its entirety.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.