What follows is a high-level analysis of screen scraping / web scraping strategies and frameworks for golang, current as of July 2019.
State of Affairs
Web scraping spans a very broad range of activity including everything from archiving content, search engine indexing, spiders and crawlers, ETL (extract, transform, load) workflows, the parsing of public json, rss, xml feeds and html pages, sophisticated bots and machine learning protocols which emulate a human with a web browser, and acceptance testing and QA (quality assurance) workflows.
There are subsequently a large number of tools which support scraping, most of which target or specialise on a subset of workflows. The available tools exist on a spectrum with low-level libraries which target developers and coders (written in a variety of languages), to browser extensions which automate interactions with the dom, to SaaS platform solutions which allow a user to design visual workflows without having to write a line of code.
In this analysis I’m going to focus on generic strategies that can be fully scripted / automated in a Linux server environment, and the libraries and frameworks that target developers. These strategies apply broadly across all languages as well, although I’ll be limiting the discussion of libraries to the Golang ecosystem.
This strategy consists of making HTTP requests directly and then parsing and processing the responses. It is a simple strategy with almost no extra overhead, save for a library to assist with DOM parsing or staging complex sets of nested, iterative, or asynchronous requests. The handling of cookies, sessions, headers, caching, and simple authentication can all be handled well with this strategy.
With 7,456 stars, this is the de-facto library for jQuery-like dom parsing, and most other scraping libraries depend on it. It is relatively full featured and supports most of the standard selectors. It started in 2012 and has remained active, with very few open issues due to its maturity and focus.
With 8,142 stars this is one of the top libraries for screen scraping in go. It is essentially a wrapper built around goquery and the standard golang http.Client with some abstractions built in to facilitate things like cookie/session handling, and making nested and parallel requests.
It is great for building things like spiders and crawlers but is subject of course to all the limitations of the browserless strategy and may begin to work against you when more focused and synchronous work needs to be done for highly complex websites. For some workflows you may find it better to ditch this library and just use http.Client directly with goquery.
Selenium WebDriver Strategy
Selenium is mature set of tools used for web browser automation. It is usable across many platforms and browsers, and can be controlled by many different languages and testing frameworks. It is supported by major browser vendors and is the core technology used in many browser automation tools, APIs and frameworks.
The upside to this strategy is that it provides robust scraping capabilities and is the de-facto industry standard and is well supported. The downside is that it relies on a rather complex technology stack. In addition to Selenium itself (which is a Java application/server which can be used locally or remotely) you will need an actual web browser and driver for selenium to control the browser, as well as a Selenium WebDriver client for go to talk to the selenium server. In this strategy WebDriver is essentially a protocol to interface with Selenium, and Selenium is the middleman between your application and the browser (and in our case we want the browser to be headless meaning it doesn’t need a GUI).
As far as the web browser and driver goes, you have a few options. In the past using a scriptable headless-browser like PhantomJS was quite common, but development on PhantomJS has been discontinued, which suits us fine because major browser vendors have started to enable official headless-browser support for Selenium Webdriver.
Go also has a few clients available for interfacing with Selenium.
This is currently the most popular Selenium WebDriver client for go, at 756 stars. It does not appear very active, and there are over 60 open issues, many of them which may be deal breakers, for example: Impossible download files in chromedriver headless mode. Documentation and examples are mediocre.
This is the second most popular Selenium WebDriver client for go, at 632 stars. It appears a little more active than the prevoius one mentioned, with only 35 open issues and may not have as many deal breakers. It is more oriented towards acceptance testing but can easily be used as an direct API to Selenium. Excellent documentation and examples.
Pure Headless-Browser Strategy
A little background on Brankas, they are based in Southeast Asia and use this as a core part of their product for scraping bank account and insurance data from various institutions. They started out using a Selenium WebDriver strategy but developed this to solve some problems with that strategy and to optimize the scraping process more for their type of usage. This would be a good one to use for things like scraping bank accounts and insurance data and a good project to contribute to. They gave a great presentation about the library at GopherCon Singapore 2017.
Average popularity at 908 stars, and hasn’t been updated for nearly a year, with quite a few open issues. This one also uses it’s own virtual browser engine. Well documented and it has a nice go API to work with.
Quite small at 138 stars. Also, PhantomJS is discontinued. This option will fall by the wayside for most but it’s an option if you really need or want PhantomJS.