Back to Buzz Blog
Dynamically generating meta tags
April 16, 2010
<meta name="description" content="">
<meta name="keywords" content="">
I see the above empty meta tags on Web pages with disturbing frequency. If a page is going to bother having a placeholder for such things, it seems prudent to actually put something there.
The usefulness of the description and keywords meta tags is hotly contested and often speculative. Since search engine algorithms are protected as zealously as the formula for Coca-Cola, the search engine optimization (SEO) industry is left to infer the utility of meta tags. Perhaps certain "lesser known" search engines make use of them, some say. Every once in awhile a luminary from the "inside" of a search engine company will hint about their place in search, but overall, it remains unclear.
What is clear is the following:
- Description meta tags have more utility than keyword meta tags.
- Keyword meta tags have little utility.
Given the possibility that meta tags might be somewhat useful, it follows that they should be used, right? After all, they can’t hurt, can they?
A great deal of effort is involved in creating a custom description and keyword list for every Web page on a site. Significant effort it turns out. If you're fast you might get the process down to 3 minutes per page, maybe. That works out to 20 pages per hour. Now, count how many pages you have and do the math. 500 pages equal 24 hours of heads-down, mind-numbing work, assuming you can maintain the pace.
You can begin to understand why the majority of Web pages have empty tags.
Our content management system, along with most others, has a field for description meta tags and a field for keyword meta tags. We suggest, cajole, recommend, educate and nudge clients to fill them out. No one argues about the importance of doing it, yet no one does it.
We recently addressed this issue and created a dynamic way to generate these tags. The challenge we faced was how to automatically create page-specific custom meta tags from existing content in a way that made sense and provided acceptable output. Here's our solution.
For description, we decided to utilize the first sentence within the body of the page content. That would provide a high degree of readability and a decent chance of relevance.
This, however, was easier said than done. It turns out that isolating the first sentence in the body was elusive. Granted, it was somewhat easier within the context of a CMS, but troublesome nonetheless.
Here's the logic we settled upon:
- Ignore pages with existing meta content. This follows our Web Site Hippocratic Oath, "Do no harm."
- Find the first sentence. Here we look for the first <p> tag and grab everything after that up to the first sentence-ending punctuation mark (.?!). Content can be quirky sometimes and we ran into a number of pages absent an opening <p>. In these cases, we looked for a semaphore or other text pattern to determine the starting location of body content. Sometimes there were double break tags. In the case where server-side includes were inserted before body content, we used the common character pattern within the include as an anchor point.
- Refine the output. We exclude markup and calls so that our results don't include content between opening <> and closing < /> markup tags. This excludes headers as well.
- Review the output. We ran the operation and dumped the output into a spreadsheet to see how we did. It resulted in a 74% success rate (where success was measured by the fetching of a complete sentence).
- Accept the success rate. A 74% success rate also translates into a 26% fail rate. Why? Lots of reasons it turns out. For example:
- Content doesn't always start with a strong opening sentence. Think about news releases where the first sentence of each release might read "for immediate release."
- Alternative use of punctuation. A period does not always mark the end of a complete sentence. We had several pages with output like:
To ratchet the success rate higher would mean layering in exception after exception. This solution was too complicated and the effort didn’t merit the payoff.
In much the same way as with description, the key is to isolate the body content of the page. However, instead of just the first sentence, we start with all the body content and refine it as follows:
- Remove markup and calls (as we did with description).
- Remove duplicate words.
- Remove stopwords. These are common words such as and, if, but, etc. We used an aggressive list of stopwords provided by our search engine software and removed them from the results.
The resulting list of keywords are single words versus multiword strings. But, given that keywords are less important than descriptions, it didn't make sense to pour a lot of effort into figuring out that puzzle.
And so, there it is. Our solution for dynamically generating meta tags. Push a button and 74% of your site's description/keyword combination are created and customized for the page. Not bad for a day's work.