Skip to Content

New Blog Post: How to keep all your websites in sync with scraping technology

Technology Blog

Technology Blog

The importance of randomness in online security

I recently visited the website random.org, which is a free/fee-based generator of random numbers.  It's been around for a long time - since 1998.  It made me revisit the concept of randomness and how oddly hard it is to achieve. Why is randomness important? It's amazing how many things rely…

Implementing dependency management with Python Poetry

One of the biggest issues with Python (our preferred development language) is dependency management.  Any individual deployment can result in minor differences in the versions of the files and libraries which make up the application.  And these minor variations introduce uncertainty and randomness into our deployment process. So, we've investigated…

Healthcare content too complex, a study

A new study accepted for publication in the British Journal of General Practice asserts that the majority of general practice websites contain content well above the recommended reading level for online content. The study analyzed 3,823 pages of content scraped from 813 Scottish general practice websites.  Analysis showed that 2,942 pages…

Migrate Away from cmsplugin-filer in a Few Easy Steps

Companion code for this post: https://github.com/ImaginaryLandscape/deprecate_cmsplugin_filer If you've been building projects using django CMS for any length of time, chances are you're familiar with Divio's cmsplugin-filer application which provided image, link, file, folder and video plugins for interacting with django-filer. And if you're here, chances are you're aware that cmsplugin-filer has now...

Tuning Site Search for Covid-19

The recent spike in usage of the term covid-19 introduced some inconsistent results in Imaginary’s custom site search engine, iScraper, a tool which utilizes Elasticsearch as its indexing engine.  We’ll take a closer look at these results and show how we corrected the problem with an Elasticsearch configuration change. iScraper...

Website Search using Django and PostgreSQL Trigrams

Over the years I've become increasingly wary of the word "easy" in software documentation. Pick a software project at random, and there's a good chance the documentation will lead off with something like "Booloogent makes the process of frobnifying your wakalixes easy!" And then you try to use the package...

The absurdity of the fight against accessible websites

2019 was an interesting year in website accessibility.  The simple and elegant premise of accessibility has been pushed aside by aggressive law firms, miserly corporations and apathetic regulatory agencies. In October the Supreme Court decided not to hear the appeal of Guillermo Robles v. Domino’s Pizza LLC.  The background is...

Multipage Forms in Django

Introduction Most online forms fit on a single page.  Think of a "join our forum" or "contact us" form into which the user enters a name, email address, and maybe a few other pieces of information.  If you're building this kind of functionality into a Django site, you can take...

Django shell_plus with Pandas, and Jupyter Notebook

The Django shell provides an environment where developers can interact with the database via Django's ORM. While the shell is great for basic interactions, it quickly becomes laborious to work in, be it due to manual imports, having to scroll through shell history to repeat commands, or working with / viewing queries returning more than, say, 20 records. These issues, and more, can be remedied by interacting with the ORM in Jupyter notebooks, using Pandas.

Step-by-Step Guide to Setting up Django-RQ

Modern websites are complicated pieces of machinery. The requests they initiate can be intensive and take more time to complete than a typical HTTP request-response cycle. This is typically refererd to as a blocking request, in that the browser (and user) waits until the response is received. If the request...

Scraping pdf, doc, and docx with Scrapy

In February 2017, Google announced its plans to discontinue its Google Site Search product. Those clients of Imaginary Landscape who had relied on Google to provide their users with a search engine service for their website looked to us for a new solution. Finding no obvious equivalent replacement, we decided to create our own website scraper and accompanying search app.

The case for a Django upgrade

It boils down to this. An upgrade costs money, sometimes a lot of money, but the result has no visible outcome. In fact, in many cases the only outcome is an assurance that you've reduced the probability of attack, intrusion, breach and related unpleasantness. By any measure, that's a tough...