Here I present a simple example of scraping a web page looking for URLs using the Visual FA engine.
Introduction
I produced this tip in order to demonstrate how easy it can be to use Visual FA to do things like scrape the web. I thought a simple example would be helpful in terms of using it.
Background
Visual FA is my lexing/tokenizing engine for C#. It is essentially an augmented regular expression engine. Unlike the one built into .NET this one is built for performance rather than features, so it doesn't things like backtracking or capturing. It also operates more efficiently than .NET's as a result. Furthermore, it can tokenize, whereas .NET's engine is simply a matcher.
Here we use it to scrape the web. This is very simple, and normally you'd be lexing/tokenizing the result instead of doing simple flat matches.
Using the code
The Scrape project is included with Visual FA.
Here's a simple example of pulling all of the URLs from google.com:
using VisualFA;
var expr = FA.Parse(@"https?\://[^"";\)]+");
var client = new HttpClient();
using (var msg = new HttpRequestMessage(HttpMethod.Get, "https://www.google.com"))
{
using (var resp = client.Send(msg))
{
using (var reader = new StreamReader(resp.Content.ReadAsStream()))
{
foreach (var match in expr.Run(reader))
{
if (match.IsSuccess)
{
Console.WriteLine(match.Value);
}
}
}
}
}
This will print every http or https URL to the console. The main thing is we're spinning up a state machine from the regular expression https?\://[^";\)]+
. That says find http:// or https:// and keep matching until we find a quote, a semicolon, or a closing parenthesis. Once we've called Parse()
we can use the FA
instance's Run()
method to return a series of FAMatch
objects. This can be done over a TextReader
, as shown above, or over a string
. Lexers like Visual FA's runners return all the content, but we only care about successful matches, so we check the IsSuccess
property to decide whether or not to print the Value
.
Obviously, you can do this with .NET's engine but it requires reading the whole page into memory before matching, and the result will be marginally less performant compared to doing so with Visual FA. That doesn't really justify using Visual FA in and of itself, but normally you'd be using it to lex content, which I've covered in the Visual FA series.
History
- 21st April, 2024 - Initial submission