Bootstrap

Building a Background Web Scraper in ASP.NET Core With Hangfire and Puppeteer Sharp

One of my favorite things about .NET is its package ecosystem. The more than 374K public packages available on NuGet provide a vast and rich collection of libraries and technical possibilities to consider when composing solutions with .NET.

One idea I was curious to explore was the possibility of adding automated web scraping inside a .NET Web API by leveraging two popular packages: Hangfire and Puppeteer Sharp.

The spoiler is that it is possible, and I'll explain how I did it in this post.

The Plan

Make a proof of concept using Hangfire and Puppeteer Sharp to add a background web scraping function within an ASP.NET Web API to scrape data from a remote web page. We'll parse and store the scraped data in an SQL Server database and implement standard API endpoints to access it.

Hangfire provides a simple way to perform background tasks in a .NET application without requiring an extra service or process. In other words, the term background here means we can run some code outside of the context of the ASP.NET Core request pipeline - but still within the overall ASP.NET Core application.

Puppeteer Sharp is a .NET port of the popular Puppeteer Node.js library. It allows developers to automate Chrome headless. It's commonly used for scraping, testing, and generating PDFs and screenshots of web pages.

I'm a hockey fan, so I created a static page on this site listing the career stats of NHL superstar Alex Ovechkin. This page will serve as the demo's scraping target and data source.

Here's a sketch of the solution.

Get notified on new posts

Straight from me, no spam, no bullshit. Frequent, helpful, email-only content.

Getting Started

Here's a link to the GitHub repo for the project.

I created a new ASP.NET Core Web API project (.net 8.0 at the time of writing) and then bootstrapped the .csproj file with the required dependencies.

Hangfire Configuration

Hangfire requires minimal configuration to get going. A database is needed to store serialized information about your background jobs, so the biggest consideration is the type of database you'll provide. According to the docs, the storage mechanism has the proper abstractions to work with RDBMS and NoSQL solutions.

I'm using a single SQL Server database for this project, which will be shared between Hangfire and the Web API. I added the AppDb connection string along with the other configuration bits to the startup code in Program.cs.

// Add Hangfire services.
builder.Services.AddHangfire(configuration =>
    configuration
        .SetDataCompatibilityLevel(CompatibilityLevel.Version_180)
        .UseSimpleAssemblyNameTypeSerializer()
        .UseRecommendedSerializerSettings()
        .UseSqlServerStorage(builder.Configuration.GetConnectionString("AppDb"))
);

// Add the processing server as IHostedService.
builder.Services.AddHangfireServer();

...

// Enable the web dashboard tool.
app.UseHangfireDashboard(
    "/hangfire",
    new DashboardOptions() { DashboardTitle = "Hangfire Dashboard" }
);

Database Configuration

I'm running on Windows using a local SQL Server Developer server, but you could also use SQL Server Express.

The DatabaseContext class has the bootstrapping code to create the database and tables if they don't exist. This is called in the Program.cs startup code.

// ensure database and tables exist
using var scope = app.Services.CreateScope();
var databaseContext = scope.ServiceProvider.GetRequiredService<DatabaseContext>();
await databaseContext.Init();

Depending on your environment, you may need to adjust the connection strings in appsettings.json

Once you have some flavor of SQL Server running and configured, run the project to create the database. If everything is set up correctly, you should see a new WebApiScraper database. You'll see Hangfire's tables, which it creates automatically, along with the application's tables created by DatabaseContext.

Hangfire Dashboard

At this point, you should also be able to see the Hangfire dashboard.

When the project is running, try to hit localhost:xxxx/hangfire in your browser to access it. You must change xxxx to whatever local port your service runs on.

Data Model & Repositories

In addition to SQL Server, we need data access support, including a few basic model classes and repositories to help us save and fetch data from the DB.

Model Classes

I added a couple of simple model classes to represent a Player and SeasonStatistic which capture the data elements we'll scrape from the web page. These structures map to tables in the DB where their data is stored. The tables get created during the database setup mentioned above 👆.

Repositories

The PlayerRepository and StatisticsRepository classes provide all the data access code to read and write to the database. These are based on Dapper, which is a popular and lightweight ORM.

Web Scraping With Puppeteer Sharp

Puppeteer is easy to work with and offers a simple API for controlling the browser and capturing web content. There are other tools that can be used for scraping in .NET, like the ootb HttpClient class with HTML Agility Pack or other browser automation tools like Playwright and Selenium.

The biggest constraint when using HttpClient versus browser-based tools is that it only provides the raw HTTP response content from the server. Therefore, no JavaScript gets executed, meaning no dynamic markup or content is available.

With the data layer plumbed in, we can add the scraping mechanism to fetch and parse the data from the web page.

The web scraping code is within the StatisticsCrawler class.

The CrawlPage() method executes the entire scrape operation.

First, it downloads a local instance of Chrome if one is not found on the machine.

var options = new LaunchOptions { Headless = true };
     Console.WriteLine("Downloading chromium");
     using var browserFetcher = new BrowserFetcher();
     await browserFetcher.DownloadAsync();

Next, a headless browser instance is created, and the GoToAsync() method is called to navigate to the given URL.

Console.WriteLine($"Navigating {url}");
     await using var browser = await Puppeteer.LaunchAsync(options);
     await using var page = await browser.NewPageAsync();

     await page.GoToAsync(url);

After the navigation is complete, the page object provides some APIs for examining and manipulating the page. I'm using the QueySelectorAsync() method to extract the player's name and hometown from the page's DOM elements.

var nameElement = await page.QuerySelectorAsync(
"body > div > div:nth-child(1) > div > div > div > div.col-md-8 > div > h5"
);

var homeTownElement = await page.QuerySelectorAsync(
    "body > div > div:nth-child(1) > div > div > div > div.col-md-8 > div > table > tbody > tr:nth-child(2) > td"
);

var name = await GetInnerText(nameElement);
var homeTown = await GetInnerText(homeTownElement);

...

Shortcut for Generating Selectors

The selectors in the previous snippet to find the name and hometown elements are lengthy and relative to the page's body. I didn't write those by hand but was able to quickly generate them using the browser's developer tools.

Simply open the page to inspect each element and then choose Copy selector from the context menu.

You can use any method to write or generate these, but I've found this approach useful for quickly obtaining the exact selector query or something very close to it that I can use to locate elements within the page.

Saving Data

As the data is fetched from the web page and parsed, the PlayerRepository and StatisticsRepository classes are called to save it to the database.

Console.WriteLine($"Attempting to create {name} in the database.");
await _playerRepository.Create(new Player() { Name = name, HomeTown = homeTown });

Web Scraping in the Background

The last part of tying everything together is scheduling the scraping job with Hangfire so it can run the CrawlPage() operation to fetch and save the data.

Hangfire provides several different ways to register background methods. There are fire-and-forget jobs, scheduled jobs that run at a specified time, and recurring tasks that can be scheduled using a cron expression to run on a regular basis.

RecurringJob.AddOrUpdate(
     "CrawlStats",
     x => x.CrawlPage("https://fullstackmark.com/static/posts/29/hockey-stats.html"),
     "0 * * ? * *" // Every minute
);

Registering a recurring task with the RecurringJob.AddOrUpdate() API is one line with the following arguments:

  • A unique string ID for the job.
  • An expression tree describing the method to execute along with any arguments to pass. The expression tree provides the type information about the target class, method, and arguments that get serialized and stored in the database. Hangfire uses this information to reconstitute the class and execute the method call in the background context.
  • A cron expression specifying when to run it - every minute in this demo.

I wrapped the registration step in a hosted service task so it would get triggered during app startup.

public class SchedulerTask : IHostedService
{
    Task IHostedService.StartAsync(CancellationToken cancellationToken)
    {
        RecurringJob.AddOrUpdate<StatisticsCrawler>(
            "CrawlStats",
            x => x.CrawlPage("https://fullstackmark.com/static/posts/29/hockey-stats.html"),
            "0 * * ? * *" // Every minute
        );
        return Task.CompletedTask;
    }
}

Finally, we must register the service in Program.cs so the task registration will run when the app boots up.

...
builder.Services.AddHostedService<SchedulerTask>();

Demo Time

Now, I can run the app directly from my terminal and see it successfully crawl the page 🚀.

Notice that no interaction is required with the API itself. The recurring task is registered with Hangfire on app startup and executed every minute as scheduled.

After checking the database, we can also see the data saved in the tables.

API Endpoint

To use the scraped data now available in the database, I added a tiny API controller providing a single GET method to fetch the info for a given player ID. i.e Players?id=1

Takeaways

  • Hangfire enables us to kick off methods in the background and outside of the standard request processing pipeline easily and reliably and is not tied to any specific .NET application type.
  • Puppeteer Sharp provides a headless Chrome API for .NET that is useful for automated scraping and testing web pages.
  • The legal implications of web scraping vary based on many factors. Web scrapers should never cause excessive load or otherwise interfere with the operation of the servers they're using. Be respectful and limit the frequency and volume of requests your scraping application makes. Use the web scraping sandbox to test scraping tools.
  • Playwright is a newer browser automation tool by Microsoft. It looks particularly valuable for testing as it supports Chromium, WebKit, and Firefox and runs cross-platform.

If you've built a web scraper in .NET, I'd love to hear about your approach or any ideas you have for improving this one in the comments below!👇

Thanks for reading!

Get notified on new posts

Straight from me, no spam, no bullshit. Frequent, helpful, email-only content.

Get notified on new posts
X

Straight from me, no spam, no bullshit. Frequent, helpful, email-only content.