Scraping pdf, doc, and docx with Scrapy
Recently updated on
This was a big project, and all of our developers worked on it in some capacity. One aspect of it that I worked on was the problem of scraping the contents of pdf, doc, and docx files that were linked from a client's website. If a user searched for "wakalixes", it was important that any pdfs/docs/docxs containing that term be reported in the search results, alongside any relevant HTML pages.
Our scraper was based on the excellent open-source Scrapy web crawler. Scrapy does a lot, but it does not natively support scraping the content of these binary document types. This blog entry will describe how to extend scrapy to do this.
For this project you will need Python 2.7 or greater. Beyond that you'll need:
Since Scrapy and Textract are written in Python, I installed them with pip. The other two I installed with
sudo apt install poppler-utils and
sudo apt install antiword, respectively. In case you were wondering, Textract uses "poppler-utils" for scraping pdf documents and "antiword" for doc files. It uses a package called "docxtotext" for docx files, but installing Textract will pull this in automatically.
For the purposes of this demo, we'll also need something to scrape, so I've created a page with links to three binary documents - one for each of our desired document types:
We'll use the URL of that page below.
LAYING THE GROUNDWORK
Once you've got Scrapy installed, make a nice empty directory to work in,
cd into it, and invoke
scrapy startproject followed by the name of your project. Let's call it "scrapy_demo":
(my-venv)$ scrapy startproject scrapy_demo
This will create a directory called "scrapy_demo", which will contain a config file ("scrapy.cfg") and another "scrapy_demo" directory. You shouldn't have to do anything with the config file, so let's go into the inner "scrapy_demo" directory:
(my-venv)$ cd scrapy_demo/scrapy_demo
GETTING MORE INTERESTING
Our scraper is going to need a custom "link extractor". As Scrapy succinctly puts it in their own documentation: "Link Extractors are objects whose only purpose is to extract links from web pages..." Bascially, when the scraper encounters a link to another document (an
<a> tag), it passes it to a
LinkExtractor object, which will decide whether to follow that link based on the document's extension. There are many different extensions that Scrapy knows about, and it is not interested in most of them. We can see this by instantiating a default
LinkExtractor and examining its
(my-venv)$ python Python 2.7.12 (default, Dec 4 2017, 14:50:18) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from scrapy.linkextractors import LinkExtractor >>> le = LinkExtractor() >>> le.deny_extensions set(['.asf', '.odg', '.wma', '.rm', '.ra', '.rar', '.odt', '.jpg', '.odp', '.ods', '.pct', '.docx', '.tif', '.jpeg', '.ps', '.zip', '.gif', '.asx', '.mpg', '.pps', '.bin', '.doc', '.svg', '.mid', '.ppt', '.mov', '.eps', '.qt', '.aac', '.avi', '.bmp', '.3gp', '.psp', '.xls', '.drw', '.png', '.rss', '.m4a', '.pst', '.pdf', '.ai', '.css', '.wav', '.m4v', '.mng', '.pptx', '.dxf', '.exe', '.swf', '.au', '.mp3', '.aiff', '.ogg', '.wmv', '.xlsx', '.tiff', '.mp4'])
Most of those extensions don't interest us either, but the list above includes ".pdf", ".doc", and ".docx", and we want to allow those. Let's start creating our custom spider now. From the "scrapy_demo" directory that we navigated to above, open a file in the "spiders" directory (which was also created by
scrapy startproject). Let's call it, oh I don't know, "itsy_bitsy.py"
(my-venv)$ vim spiders/itsy_bitsy.py
Put some soon-to-be-needed imports and constants at the top of the file, and then we'll define our custom LinkExtractor, which will handle documents with the extensions we want to deal with.
import re import textract from itertools import chain from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from tempfile import NamedTemporaryFile control_chars = ''.join(map(chr, chain(range(0, 9), range(11, 32), range(127, 160)))) CONTROL_CHAR_RE = re.compile('[%s]' % re.escape(control_chars)) TEXTRACT_EXTENSIONS = [".pdf", ".doc", ".docx", ""] class CustomLinkExtractor(LinkExtractor): def __init__(self, *args, **kwargs): super(CustomLinkExtractor, self).__init__(*args, **kwargs) # Keep the default values in "deny_extensions" *except* for those types we want. self.deny_extensions = [ext for ext in self.deny_extensions if ext not in TEXTRACT_EXTENSIONS]
Below that, let's define our custom spider. We'll set the
start_urls to the test page mentioned above. We also override the class's
__init__() method to make use of our
class ItsyBitsySpider(CrawlSpider): name = "itsy_bitsy" start_urls = [ 'https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html' ] def __init__(self, *args, **kwargs): self.rules = (Rule(CustomLinkExtractor(), follow=True, callback="parse_item"),) super(ItsyBitsySpider, self).__init__(*args, **kwargs)
Notice that our
__init__() method mentions a callback method called
parse_item(), which we will now have to define. The body of this method defines how content retrieved by Scrapy will be processed. For example, under normal circumstances you might send the scraped content to an Elasticsearch index, as we do for our own search app. For the purposes of this demonstration, however, we're just going to write any content scraped from one of our binary documents to a file, and ignore everything else.
parse_item() method right under the
def parse_item(self, response): if hasattr(response, "text"): # The response is text - we assume html. Normally we'd do something # with this, but this demo is just about binary content, so... pass else: # We assume the response is binary data # One-liner for testing if "response.url" ends with any of TEXTRACT_EXTENSIONS extension = list(filter(lambda x: response.url.lower().endswith(x), TEXTRACT_EXTENSIONS)) if extension: # This is a pdf or something else that Textract can process # Create a temporary file with the correct extension. tempfile = NamedTemporaryFile(suffix=extension) tempfile.write(response.body) tempfile.flush() extracted_data = textract.process(tempfile.name) extracted_data = extracted_data.decode('utf-8') extracted_data = CONTROL_CHAR_RE.sub('', extracted_data) tempfile.close() with open("scraped_content.txt", "a") as f: f.write(response.url.upper()) f.write("\n") f.write(extracted_data) f.write("\n\n")
Here's what's happening: If the response object has a "text" attribute, we continue without dealing with it. Of course, during a typical scrape of a website, most responses will be HTML pages and they will have a "text" attribute. In that case, you probably wouldn't just throw those responses away! But that's what we're doing here because we're only interested in the binary types.
So moving on, we check the extension of this binary document. This one-liner will put it into the variable
extension if it appears in our
TEXTRACT_EXTENSIONS list. Note that we are using the filename extension as the sole determiner of the file type. If we run across a pdf that for some reason has been given a ".doc" extension, the process won't work. Making the determination in this way is naive, but as we will see, Textract makes its decisions in the same way.
The next part feels slightly inefficient, as there is - at the time of writing - no way to stream data to Textract other than by creating a physical file in the file system. Furthermore, this file must have the correct extension if Textract is going to handle it properly. We leverage python's
NamedTemporaryFile class to create this file with minimum fuss, and write the response body - the binary content of the document - into it.
We then point Textract at our temporary file and let it do its thing. Data extracted from binary documents will contain control characters that determine formatting and display, but that do not contain human-readable content. We don't want these characters in our scrape, so we strip them out with our
CONTROL_CHAR_RE regular expression. Afterwards, we call
close() on the temporary file, which causes it to be deleted. And that's pretty much it!
Now, still within the inner
scrapy_demo directory, you can try to run your code with
(my-venv)$ scrapy crawl itsy_bitsy
And afterwards, the scraped content of the binary documents should be written to "scraped_content.txt"
Some final notes: in a real project you would of course want to abstract some of this code away for unit testing, add a try/except block around the call to
textract.process(), and do something with the output other than writing it to a file, but by and large this is how we handled the problem.
I hope this explanation has been helpful. Happy coding!