What follows is a high-level analysis of screen scraping / web scraping strategies and frameworks for golang, current as of July 2019.

State of Affairs

Web scraping spans a very broad range of activity including everything from archiving content, search engine indexing, spiders and crawlers, ETL (extract, transform, load) workflows, the parsing of public json, rss, xml feeds and html pages, sophisticated bots and machine learning protocols which emulate a human with a web browser, and acceptance testing and QA (quality assurance) workflows.

There are subsequently a large number of tools which support scraping, most of which target or specialise on a subset of workflows. The available tools exist on a spectrum with low-level libraries which target developers and coders (written in a variety of languages), to browser extensions which automate interactions with the dom, to SaaS platform solutions which allow a user to design visual workflows without having to write a line of code.

In this analysis I’m going to focus on generic strategies that can be fully scripted / automated in a Linux server environment, and the libraries and frameworks that target developers. These strategies apply broadly across all languages as well, although I’ll be limiting the discussion of libraries to the Golang ecosystem.

Browserless Strategy

This strategy consists of making HTTP requests directly and then parsing and processing the responses. It is a simple strategy with almost no extra overhead, save for a library to assist with DOM parsing or staging complex sets of nested, iterative, or asynchronous requests. The handling of cookies, sessions, headers, caching, and simple authentication can all be handled well with this strategy.

The tasks best suited for this strategy include the parsing of various feeds and html data that is relatively well structured or designed to be machine readable. This strategy works best with publicly accessible data and can start to fall down when complex authentication schemes are required. It fails hard when dealing with pages which rely heavily on JavaScript and SPA-like (single page app) elements and no API (because if there were an API you could just use that directly).

goquery

github.com/PuerkitoBio/goquery

With 7,456 stars, this is the de-facto library for jQuery-like dom parsing, and most other scraping libraries depend on it. It is relatively full featured and supports most of the standard selectors. It started in 2012 and has remained active, with very few open issues due to its maturity and focus.

colly

github.com/gocolly/colly

With 8,142 stars this is one of the top libraries for screen scraping in go. It is essentially a wrapper built around goquery and the standard golang http.Client with some abstractions built in to facilitate things like cookie/session handling, and making nested and parallel requests.

It is great for building things like spiders and crawlers but is subject of course to all the limitations of the browserless strategy and may begin to work against you when more focused and synchronous work needs to be done for highly complex websites. For some workflows you may find it better to ditch this library and just use http.Client directly with goquery.

Selenium WebDriver Strategy

Selenium is mature set of tools used for web browser automation. It is usable across many platforms and browsers, and can be controlled by many different languages and testing frameworks. It is supported by major browser vendors and is the core technology used in many browser automation tools, APIs and frameworks.

Selenium has been used primarily for acceptance testing and quality assurance workflows but is highly suited to scraping because it overcomes the primary limitations of the browserless strategy with the WebDriver API, which enables you to drive a browser natively as if you were a user. This makes tasks like clicking javascript links, submitting forms, and navigating SPA-like pages possible.

The upside to this strategy is that it provides robust scraping capabilities and is the de-facto industry standard and is well supported. The downside is that it relies on a rather complex technology stack. In addition to Selenium itself (which is a Java application/server which can be used locally or remotely) you will need an actual web browser and driver for selenium to control the browser, as well as a Selenium WebDriver client for go to talk to the selenium server. In this strategy WebDriver is essentially a protocol to interface with Selenium, and Selenium is the middleman between your application and the browser (and in our case we want the browser to be headless meaning it doesn’t need a GUI).

As far as the web browser and driver goes, you have a few options. In the past using a scriptable headless-browser like PhantomJS was quite common, but development on PhantomJS has been discontinued, which suits us fine because major browser vendors have started to enable official headless-browser support for Selenium Webdriver.

Go also has a few clients available for interfacing with Selenium.

selenium

github.com/tebeka/selenium

This is currently the most popular Selenium WebDriver client for go, at 756 stars. It does not appear very active, and there are over 60 open issues, many of them which may be deal breakers, for example: Impossible download files in chromedriver headless mode. Documentation and examples are mediocre.

agouti

github.com/sclevine/agouti

This is the second most popular Selenium WebDriver client for go, at 632 stars. It appears a little more active than the prevoius one mentioned, with only 35 open issues and may not have as many deal breakers. It is more oriented towards acceptance testing but can easily be used as an direct API to Selenium. Excellent documentation and examples.

Pure Headless-Browser Strategy

There do exist a smattering of go libraries which function as a headless browser with little to no overhead. Many of them are small and not well maintained, but there is at least one gem among them. With this strategy you would ideally get most or all of the benefits of driving a browser without the overhead of Selenium. There is quite a spread here. Some of them use their own virtual browser (which is going to be limited in something like JavaScript support) while others rely on PhantomJS or WebKitGTK+.

chromedp

github.com/chromedp/chromedp

This one is a really good find. Quite popular at 3,440 stars and very active. It was developed internally at Brankas and is maintained by them. Documentation is good with a lot of examples. This library sits directly on top of Chrome DevTools running Chrome (headless or not), which means that anything you could do in Chrome DevTools such as waiting for a div to load, starting or stopping javascript execution, setting breakpoints, etc. you can now control directly from a clean Go api. It is written in idiomatic go and they have also dockerized the dependencies.

A little background on Brankas, they are based in Southeast Asia and use this as a core part of their product for scraping bank account and insurance data from various institutions. They started out using a Selenium WebDriver strategy but developed this to solve some problems with that strategy and to optimize the scraping process more for their type of usage. This would be a good one to use for things like scraping bank accounts and insurance data and a good project to contribute to. They gave a great presentation about the library at GopherCon Singapore 2017.

webloop

github.com/sourcegraph/webloop

Somewhat popular at 1,184 stars. Has not had any updates in nearly a year. Average documentation and features. Reliant on WebKitGTK+ and go-webkit2.

surf

github.com/headzoo/surf

Average popularity at 908 stars, and hasn’t been updated for nearly a year, with quite a few open issues. This one also uses it’s own virtual browser engine. Well documented and it has a nice go API to work with.

go-phantomjs

github.com/urturn/go-phantomjs

Quite small at 138 stars. Also, PhantomJS is discontinued. This option will fall by the wayside for most but it’s an option if you really need or want PhantomJS.