Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

A Simple Example of Scraping a Web Page Using Visual FA

4.90/5 (5 votes)
20 Apr 2024MIT2 min read 10K  
Scraping the web is easy with Visual FA. Here's an example of how.
Here I present a simple example of scraping a web page looking for URLs using the Visual FA engine.

Introduction

I produced this tip in order to demonstrate how easy it can be to use Visual FA to do things like scrape the web. I thought a simple example would be helpful in terms of using it.

Background

Visual FA is my lexing/tokenizing engine for C#. It is essentially an augmented regular expression engine. Unlike the one built into .NET this one is built for performance rather than features, so it doesn't things like backtracking or capturing. It also operates more efficiently than .NET's as a result. Furthermore, it can tokenize, whereas .NET's engine is simply a matcher.

Here we use it to scrape the web. This is very simple, and normally you'd be lexing/tokenizing the result instead of doing simple flat matches.

Using the code

The Scrape project is included with Visual FA.

Here's a simple example of pulling all of the URLs from google.com:

C#
using VisualFA;
var expr = FA.Parse(@"https?\://[^"";\)]+");
var client = new HttpClient();
using (var msg = new HttpRequestMessage(HttpMethod.Get, "https://www.google.com"))
{
    using (var resp = client.Send(msg))
    {
        using (var reader = new StreamReader(resp.Content.ReadAsStream()))
        {
            foreach (var match in expr.Run(reader))
            {
                if (match.IsSuccess)
                {
                    Console.WriteLine(match.Value);
                }
            }
        }
    }
}

This will print every http or https URL to the console. The main thing is we're spinning up a state machine from the regular expression https?\://[^";\)]+. That says find http:// or https:// and keep matching until we find a quote, a semicolon, or a closing parenthesis. Once we've called Parse() we can use the FA instance's Run() method to return a series of FAMatch objects. This can be done over a TextReader, as shown above, or over a string. Lexers like Visual FA's runners return all the content, but we only care about successful matches, so we check the IsSuccess property to decide whether or not to print the Value.

Obviously, you can do this with .NET's engine but it requires reading the whole page into memory before matching, and the result will be marginally less performant compared to doing so with Visual FA. That doesn't really justify using Visual FA in and of itself, but normally you'd be using it to lex content, which I've covered in the Visual FA series.

History

  • 21st April, 2024 - Initial submission

License

This article, along with any associated source code and files, is licensed under The MIT License