So you and your team have been talking about your new social web app. Development has been going well, and someone had the bright idea: “Hey, we should automatically populate some information from the links our users share. Maybe pull in an image or two too!”
Now you have to build a web scraper. How hard could it be?
Don’t you worry friend. We’ll do it together.
(Mess around with the actual web scraper I built live!)
Make the Client Do the Work
Web scraping is not terribly resource intensive, but maybe you’re dreaming big and want to do a live scrape for every link anytime one of your users looks at it. No problem, add it to the client code and keep the extra load off your server!
Yeah . . . no. Unfortunately, because of browsers’ same-origin policy, it is not possible for the client to fetch html from anywhere but your own server (or other servers that have explictly whitelisted it ahead of time). In order to avoid using your own server, you’re basically going to have to hook into some proxy service, which really isn’t helping anything.
So let’s just build the scraper on our own server.
Your Tools
Your going to want to head over to your console and npm install
request and cheerio.
Request is the standard for server-side http requests. It makes it dead simple to send GET, POST or whatever requests to whereever you want. Even better, install request-promise instead of request, to get all that request goodness with some clean and clear Bluebird promises baked in.
Meanwhile, Cheerio is a server-side implementation of jQuery. You are going to be parsing through some html files in a second, and you are definitely going to want access to jQuery syntax when you do it.
Fetching the HTML
Using request-promise
to get the html you want couldn’t be simpler:
That’s it. So now that we have html, what do we do with it? What goes into that little scrape
function there?
Cheerio And You
Cheerio requires a little bit of extra setup. After require
ing it normally, you use the load
method to create a jQuery-like $
object with all of the HTML you’re planning to scrape. So, if your goal was just to return the title
of a webpage we, you code might look something like this:
Once properly loaded, cheerio works identically to jQuery, giving you access to all the same selectors and methods you would normally have on the front-end. So now that you have all of this power, what do you do with it?
meta
Is So Meta
There’s a good chance that whatever information you wanted to scrape can be found in a site’s metadata. With the rise of social networks, most webpages are hoping to be shared, liked, tweeted, upvoted, pinned, or übermensched. Enter the meta
tag. Designed to contain information about a site’s title, subject matter, authorship, and more, these tags are left in the head
of a page for an enterprising social media gurus like yourself (and Facebook, mostly Facebook) to find. For example, here’s some of the HTML from a Udemy course:
Well hello metadata. So what is all of this? Well, there isn’t yet any standard metadata system, and since meta
tags are basically roll-your-own, most social networks have done just that. The og
tags are used by Facebook, the twitter
tags by Twitter. Udemy has even built some of their own custom udemy_com
tags. Most sites will try to cover all of their bases, and include a hodge-podge of tags, many redundant, just to make sure they don’t spoil their chances of being the next viral sensation.
In all of these cases you’ll use attr('content')
to get the information you need from the meta tag, but the selector is a little more complicated. Standard class or id selectors obviously won’t work, which is why you’ll want jQuery’s attribute selector. For the Twitter title for example, that would look like $('meta[name="twitter:title"]')
. For the OG title, it would be $('meta[property="og:title"]')
. Armed with these tools, we could easily write a simple metadata scraper:
This code will work assuming every site always uses Facebook metadata, which is not as reliable as you might think. And what if, heaven forbid, they don’t use meta
tags at all? If your needs are simple, you may be fine with a scraper that only works most of the time, but if you’re planning on building something really robust, you’ll need a system.
Keep It Organized With JSON
What we want is an array of possible tags we can have Cheerio cycle through until it finds a match. This is exactly the sort of data we can store out of the way in a seperate JSON file. For example, if we wanted to design a slightly more robust scraper for title and description, the JSON file might look like this:
With that in place we can write a simple function to loop through the specified options until it finds something.
The big advantage of this architecture, is that as you discover new pages with weird tags, adding them is as simple as adding a new object to your JSON file. Of course, if you ever want to scrape something other than meta
tags, your code will have to get a fair amount more complicated (heaven help you if you start scraping text
), but I’m sure you can hash out all those details for yourself.