Published on

Distributed web scraping using Electron.js and Supabase edge functions

Authors

At the begining of this year I started a new project which needed to scrape job listings from various online boards like LinkedIn, Indeed, Glassdoor etc. Knowing very well the limitations of web scraping in general, I had to think of a way to do it without having to purchase IP proxies or scraping APIs since I was on a budget.

The solution I came up with was to distribute the web scraping part to every user's machine, but keep the data extraction logic centralized. In this article I want to explain how this works.

Scraping

Usually people go for tools like Selenium, Puppeteer, Playwright etc which are basically headless browsers. Why? Because most sites are JavaScript heavy and if you'd only do an HTTP request and try to parse the HTML response you would probably see a blank page.

My approach was to build the app using Electron.js and leverage the bundled Chomium instance to do the web scraping locally on the user's device. This has several benefits that overcome most anti-scrape guards that sites have in place:

  • each user is doing the scraping from their own IP address so no need for proxies
  • cloudflare bot protection is bypassed 99% of the time since the requests are similar to how a user would browse the website using their regular Chrome, Firefox etc browser
  • it's basically free since it doesn't require running any backend servers which run headless browsers

And the most fun part of it is that it's dead simple to implement. This code snippet is all you need in order to download the HTML code of a web page using electron:

const window = new BrowserWindow({
  // do not show the window to the user
  show: false,
  // set the window size
  width: 1600,
  height: 1200,
  webPreferences: {
    // disable the same origin policy
    webSecurity: false,
  },
})

const url = 'https://...'
await window.loadURL(url)

// give the window a bit of time to run the javascript code (mainly needed for single page apps)
await sleep(5_000)

const html: string = await window.webContents.executeJavaScript(
  'document.documentElement.innerHTML'
)

That's it ... mostly. Besides that I also had to add some retries to cover loading failures as well handling rate limits (429 HTTP response codes), but that's the gist of it.

Now that we have the html string of the page, time to extract the data we're interested in from it.

Data extraction

Building the html parsing logic in the electron app wouldn't be a great idea because sites can push updates that break it and I would be forced to push new app versions which fix the parsing logic, but that means users would be left with a non-functional app while waiting for the update. Also, a lot of users don't do updates so the app would be completely broken for them.

A more stable approach is to send the html string to a backend for extracting the data and saving it in the database. Why? Because this way, whenever a site pushes an update that breaks the parser, I can easily fix it, update the server and all users would have access to the fixed version immediatelly, no need to wait for an app update.

This sounds like a great fit for a serverless function. Fortunately, I was already using Supabase for handling user authentication and they offer a serverless functions product too. They call it Edge Functions and it's pretty much the same thing as AWS Lambda, but with a small catch. They are backed by Deno instead of Nodejs. Main difference is that you need to import either deno specific packages or use a different syntax for importing npm packages. Other than that the differences are negligible.

In node.js I would normally use cheerio to parse an html string into a virtual dom and extract data from it. But Deno has a better option with deno-dom-wasm. Yep, that's right, JS dom parser compiled to web assembly which makes it lighning fast.

So I wrote a small edge function to parse the incoming HTML string and save the results in the DB for the requesting user. At the heart of it the code looks like this:

import { DOMParser, Element } from 'https://deno.land/x/[email protected]/deno-dom-wasm.ts'

const document = new DOMParser().parseFromString(html, 'text/html')

const jobsList = document.querySelector('.jobs-search__results-list')
if (!jobsList) {
  return []
}

const jobElements = Array.from(jobsList.querySelectorAll('li')) as Element[]
const jobs = jobElements.map((el): ParsedJob | null => {
  const title = el.querySelector('.base-search-card__title')?.textContent?.trim()
  if (!title) return null

  const companyName = el
    .querySelector('.base-search-card__subtitle')
    ?.querySelector('a')
    ?.textContent?.trim()
  if (!companyName) return null

  return {
    title,
    companyName,
  }
})

// save jobs in the DB
await supabaseClient.from('jobs').insert(jobs)

Conclusion

This approach helped me to keep the costs of the project really low (only paying for Supabase hosting) by distributing the web scraping to every user. It probably won't work for most web scraping projects, but it worked like a charm for my use case since every user needs to scrape their own job feed, we don't actually scrape all existing jobs from those sites.

One other major downside to this approach is that you're forced to distribute the app as a desktop installer, not a website and could potentially scare away users who don't like installing apps.