Scraping pdf, doc, and docx with Scrapy

INGREDIENTS

For this project you will need Python 2.7 or greater. Beyond that you'll need:

Since Scrapy and Textract are written in Python, I installed them with pip. The other two I installed with sudo apt install poppler-utils and sudo apt install antiword, respectively. In case you were wondering, Textract uses "poppler-utils" for scraping pdf documents and "antiword" for doc files. It uses a package called "docxtotext" for docx files, but installing Textract will pull this in automatically.

For the purposes of this demo, we'll also need something to scrape, so I've created a page with links to three binary documents - one for each of our desired document types:

https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html

We'll use the URL of that page below.

LAYING THE GROUNDWORK

Once you've got Scrapy installed, make a nice empty directory to work in, cd into it, and invoke scrapy startproject followed by the name of your project. Let's call it "scrapy_demo":


(my-venv)$ scrapy startproject scrapy_demo

This will create a directory called "scrapy_demo", which will contain a config file ("scrapy.cfg") and another "scrapy_demo" directory. You shouldn't have to do anything with the config file, so let's go into the inner "scrapy_demo" directory:


(my-venv)$ cd scrapy_demo/scrapy_demo

GETTING MORE INTERESTING

Our scraper is going to need a custom "link extractor". As Scrapy succinctly puts it in their own documentation: "Link Extractors are objects whose only purpose is to extract links from web pages..." Bascially, when the scraper encounters a link to another document (an <a> tag), it passes it to a LinkExtractor object, which will decide whether to follow that link based on the document's extension. There are many different extensions that Scrapy knows about, and it is not interested in most of them. We can see this by instantiating a default LinkExtractor and examining its deny_extensions attribute:


(my-venv)$ python
Python 2.7.12 (default, Dec  4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from scrapy.linkextractors import LinkExtractor
>>> le = LinkExtractor()
>>> le.deny_extensions
set(['.asf', '.odg', '.wma', '.rm', '.ra', '.rar', '.odt', '.jpg', '.odp', '.ods', '.pct', '.docx', '.tif', '.jpeg', '.ps', '.zip', '.gif', '.asx', '.mpg', '.pps', '.bin', '.doc', '.svg', '.mid', '.ppt', '.mov', '.eps', '.qt', '.aac', '.avi', '.bmp', '.3gp', '.psp', '.xls', '.drw', '.png', '.rss', '.m4a', '.pst', '.pdf', '.ai', '.css', '.wav', '.m4v', '.mng', '.pptx', '.dxf', '.exe', '.swf', '.au', '.mp3', '.aiff', '.ogg', '.wmv', '.xlsx', '.tiff', '.mp4'])

Most of those extensions don't interest us either, but the list above includes ".pdf", ".doc", and ".docx", and we want to allow those. Let's start creating our custom spider now. From the "scrapy_demo" directory that we navigated to above, open a file in the "spiders" directory (which was also created by scrapy startproject). Let's call it, oh I don't know, "itsy_bitsy.py"


(my-venv)$ vim spiders/itsy_bitsy.py

Put some soon-to-be-needed imports and constants at the top of the file, and then we'll define our custom LinkExtractor, which will handle documents with the extensions we want to deal with.


import re
import textract
from itertools import chain
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tempfile import NamedTemporaryFile

control_chars = ''.join(map(chr, chain(range(0, 9), range(11, 32), range(127, 160))))
CONTROL_CHAR_RE = re.compile('[%s]' % re.escape(control_chars))
TEXTRACT_EXTENSIONS = [".pdf", ".doc", ".docx", ""]

class CustomLinkExtractor(LinkExtractor):

    def __init__(self, *args, **kwargs):
        super(CustomLinkExtractor, self).__init__(*args, **kwargs)
        # Keep the default values in "deny_extensions" *except* for those types we want.
        self.deny_extensions = [ext for ext in self.deny_extensions if ext not in TEXTRACT_EXTENSIONS]

Below that, let's define our custom spider. We'll set the start_urls to the test page mentioned above. We also override the class's __init__() method to make use of our CustomLinkExtractor.


class ItsyBitsySpider(CrawlSpider):
    name = "itsy_bitsy"
    start_urls = [
        'https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html'
    ]

    def __init__(self, *args, **kwargs):
        self.rules = (Rule(CustomLinkExtractor(), follow=True, callback="parse_item"),)
        super(ItsyBitsySpider, self).__init__(*args, **kwargs)

Notice that our __init__() method mentions a callback method called parse_item(), which we will now have to define. The body of this method defines how content retrieved by Scrapy will be processed. For example, under normal circumstances you might send the scraped content to an Elasticsearch index, as we do for our own search app. For the purposes of this demonstration, however, we're just going to write any content scraped from one of our binary documents to a file, and ignore everything else.

Add the parse_item() method right under the __init__() method.


    def parse_item(self, response):
        if hasattr(response, "text"):
            # The response is text - we assume html. Normally we'd do something                                                                                                                    
            # with this, but this demo is just about binary content, so...                                                                                                                         
            pass
        else:
            # We assume the response is binary data                                                                                                                                                
            # One-liner for testing if "response.url" ends with any of TEXTRACT_EXTENSIONS                                                                                                         
            extension = list(filter(lambda x: response.url.lower().endswith(x), TEXTRACT_EXTENSIONS))[0]
            if extension:
                # This is a pdf or something else that Textract can process                                                                                                                        
                # Create a temporary file with the correct extension.                                                                                                                              
                tempfile = NamedTemporaryFile(suffix=extension)
                tempfile.write(response.body)
                tempfile.flush()
                extracted_data = textract.process(tempfile.name)
                extracted_data = extracted_data.decode('utf-8')
                extracted_data = CONTROL_CHAR_RE.sub('', extracted_data)
                tempfile.close()

		with open("scraped_content.txt", "a") as f:
                    f.write(response.url.upper())
                    f.write("\n")
                    f.write(extracted_data)
                    f.write("\n\n")

Here's what's happening: If the response object has a "text" attribute, we continue without dealing with it. Of course, during a typical scrape of a website, most responses will be HTML pages and they will have a "text" attribute. In that case, you probably wouldn't just throw those responses away! But that's what we're doing here because we're only interested in the binary types.

So moving on, we check the extension of this binary document. This one-liner will put it into the variable extension if it appears in our TEXTRACT_EXTENSIONS list. Note that we are using the filename extension as the sole determiner of the file type. If we run across a pdf that for some reason has been given a ".doc" extension, the process won't work. Making the determination in this way is naive, but as we will see, Textract makes its decisions in the same way.

The next part feels slightly inefficient, as there is - at the time of writing - no way to stream data to Textract other than by creating a physical file in the file system. Furthermore, this file must have the correct extension if Textract is going to handle it properly. We leverage python's NamedTemporaryFile class to create this file with minimum fuss, and write the response body - the binary content of the document - into it.

We then point Textract at our temporary file and let it do its thing. Data extracted from binary documents will contain control characters that determine formatting and display, but that do not contain human-readable content. We don't want these characters in our scrape, so we strip them out with our CONTROL_CHAR_RE regular expression. Afterwards, we call close() on the temporary file, which causes it to be deleted. And that's pretty much it!

Now, still within the inner scrapy_demo directory, you can try to run your code with


(my-venv)$ scrapy crawl itsy_bitsy

And afterwards, the scraped content of the binary documents should be written to "scraped_content.txt"

Some final notes: in a real project you would of course want to abstract some of this code away for unit testing, add a try/except block around the call to textract.process(), and do something with the output other than writing it to a file, but by and large this is how we handled the problem.

I hope this explanation has been helpful. Happy coding!

Tagged pdf, docx, textract, scrapy, doc, poppler-utils, antiword, search, scraping