Download Images from HTML - Including WordPress Posts

I recently wanted to download images from HTML, and I wanted to share the script that I wrote for this.

Download Images from HTML – Introduction

I had some WordPress posts that I wanted to backup, but I wanted to make sure that I had the text as well as the images.

This was a simple script, but it solved a problem that I had had recently. That said, I could see myself using this or an updated version for some OSINT or phishing in the future.

This will be another short script like my last post, but I’ve been automating some small problems as of late!

Parsing the HTML

First, I tried to just use regex to parse out any img src attributes from my flat text file.

As I should have expected, and this Stack Overflow answer has warned, this proved to be annoying at best.

That said, I ended up using Beautiful Soup to parse the file and extract the links that I wanted. Then, I downloaded the images, and saved them locally.

Note that I was using a flat text file with multiple WordPress posts, but this process should work fine for any HTML document that you are interacting with.

The Code

You can find the code for this process below. Note that there is also a small loop with some regex in the middle of the script. This is to download the original image files from WordPress, as opposed to the resized versions. You should not see any adverse effects from this section, if your files do not end in what looks like a resolution (i.e.: -300×400.jpg).

Also, I’m sure that RecViking is proud of me for finally using some regex again.

import bs4
import re
import requests
import shutil
import sys
from urllib import urlopen

links = []
origLinks = []

# https://stackoverflow.com/questions/18042661/using-bs4-to-extract-text-in-html-files
page = urlopen('posts.txt').read().decode('utf-8')
soup = bs4.BeautifulSoup(page, "html.parser")
for node in soup.findAll('img'):
    #print(node['src'])
    links.append(node['src'])

# https://docs.python.org/3.4/library/re.html
for link in links:
    if re.search("-[0-9]*x[0-9]*\.jpg", link):
        origLinks.append(re.sub("-[0-9]*x[0-9]*\.jpg", ".jpg", link))
    else:
        origLinks.append(link)

# https://www.dev2qa.com/how-to-download-image-file-from-url-use-python-requests-or-wget-module/
for link in origLinks:
    filename = ""
    if link.find('/'):
        filename = link.rsplit('/', 1)[1]

    resp = requests.get(link, stream=True)
    localFile = open(filename, 'wb')
    resp.raw.decode_content = True
    shutil.copyfileobj(resp.raw, localFile)
    del resp

When I executed this script, it downloaded all my images as expected!

root@kali:~/Documents/imgExtract# ls
imgExtract.py  posts.txt
root@kali:~/Documents/imgExtract# python imgExtract.py 
root@kali:~/Documents/imgExtract# ls *.jpg
13177978_10206776407554475_5913650144102677662_n.jpg  IMG_1951.jpg
IMG_1119.jpg                                          IMG_1952.jpg
IMG_1120.jpg                                          IMG_1953.jpg
IMG_1136.jpg                                          IMG_1954.jpg
IMG_1195.jpg                                          IMG_2151-e1468181469199.jpg
IMG_1237.jpg                                          IMG_2160.jpg
IMG_1238.jpg                                          IMG_2161-e1468181539638.jpg
IMG_1245.jpg                                          IMG_2163.jpg
IMG_1247.jpg                                          IMG_2210-e1468181566583.jpg
IMG_1270.jpg                                          IMG_2418-e1468183375745.jpg
IMG_1271.jpg                                          IMG_2426.jpg
IMG_1934.jpg                                          IMG_2427.jpg
IMG_1936.jpg                                          IMG_2429-e1468183409990.jpg
IMG_1948.jpg

As usual, you can find the code and any updates in my GitHub repository.

Please feel free to submit any pull requests, if you use this for anything else, especially offensive related.

Download Images from HTML – Conclusion

This was a simpler script, but it solved a problem that I was having.

I’m not sure if it qualifies as a “security” tool, but I could see a few uses for it here and there.

I also found some older EverSec challenges, so I’m hoping to go through them soon and see if any are worth blogging about.

Ray Doyle

Ray Doyle is an avid pentester/security enthusiast/beer connoisseur who has worked in IT for almost 16 years now. From building machines and the software on them, to breaking into them and tearing it all down; he’s done it all. To show for it, he has obtained an OSCE, OSCP, eCPPT, GXPN, eWPT, eWPTX, SLAE, eMAPT, Security+, ICAgile CP, ITIL v3 Foundation, and even a sabermetrics certification!

He currently serves as a Senior Staff Adversarial Engineer for Avalara, and his previous position was a Principal Penetration Testing Consultant for Secureworks.

This page contains links to products that I may receive compensation from at no additional cost to you. View my Affiliate Disclosure page here. As an Amazon Associate, I earn from qualifying purchases.

Download Images from HTML – Including WordPress Posts

Download Images from HTML – Introduction

Parsing the HTML

The Code

Download Images from HTML – Conclusion

Leave a ReplyCancel Reply

Download Images from HTML – Introduction

Parsing the HTML

The Code

Download Images from HTML – Conclusion

Leave a ReplyCancel Reply

Related Posts

BEST Hacking Software – Learn the Tools of the Trade

Learn Penetration Testing – How to Become an Ethical Hacker!

Cyber Security Certifications and Courses – Gotta Catch ‘Em All!