Modern Screen Scraping With HtmlAgilityPack

Modern Screen Scraping With HtmlAgilityPack

Modern Screen Scraping With HtmlAgilityPack

In the early days of the Internet, before web services were as common as Starbucks, one of the few ways to pull data from other systems was through screen scraping web pages. I helped with at least one of these apps to scrape stock prices from Yahoo. To say it was clunky and hard to maintain is an understatement. Now days every system exposes an API through a web service, or even data brokers like Biztalk, and we’re able to orchestrate beautiful data flows….

But what happens when we don’t have any clean API to extract data from a remote system. Sometimes the best route is the most direct route, it may not pretty but screen scraping still gets the job done. And as Alec said, “Always Be Closing”. Luckily, there’s a lot better options now to accomplish HTML screen scraping.

Today there’s a great framework available on Codeplex called HTML Agility Pack. You can also find updates on their Twitter Feed. This .NET library offers a simple method for parsing and even modifying HTML files. But lets focus on how we can use it to extract data from a web page. You can also use this library to parse and then update the HTML files.

Lets assume we have a page with the following HTML.

<TABLE ALIGN=center border=1 bgcolor=lightblue width=80%>
<tr><th>ATTRIBUTE</th><th>VALUE</th><th>ATTRIBUTE DESCRIPTION</th></tr>
<tr><td>Name</td><td>HR System</td><td>Name of Application</td></tr>
<tr> <td>Version</td><td>4.2.101</td><td>Code Version</td> </tr>
<tr> <td>HOST</td><td>ATL0WAPP001</a></td><td>Name of machine</td> </tr>
</TABLE>

So lets grab the latest Nuget package, you can search for it  in Visual Studio, or find it here HTMLAgilityPack 1.4.6. Some VB.NET code that will parse the table cell data using HTML Agility Pack would look like this:

Using client As New Net.WebClient

    Dim filename As String = IO.Path.GetTempFileName

    client.Credentials = CredentialCache.DefaultNetworkCredentials
    client.DownloadFile(_URL, filename)
    
    Dim doc = New HtmlAgilityPack.HtmlDocument

    doc.Load(filename)

    Dim root As HtmlAgilityPack.HtmlNode = doc.DocumentNode

    Dim nodes As List(Of HtmlAgilityPack.HtmlNode) = root.Descendants("tr").ToList

    For Each node In nodes
        Dim tdlist As List(Of HtmlAgilityPack.HtmlNode)

        tdlist = node.Descendants("td").ToList

        Console.WriteLine(tdlist(0).InnerText & ": " & tdlist(1).InnerText)
    Next
End Using

Or if you’d rather go the Powershell route, you can write something like this:

$HAPDllPath = "C:\Users\...\packages\HtmlAgilityPack.1.4.6\lib\Net45\HtmlAgilityPack.dll"

$a = [Reflection.Assembly]::LoadFile($HAPDllPath)

$source = "http:/localhost/mytable.html"
$destination = "c:\temp\myfile.htm"

$client = New-Object System.Net.WebClient
$client.Credentials = [System.Net.CredentialCache]::DefaultNetworkCredentials 
$client.DownloadFile($source, $destination)

$doc = New-Object HtmlAgilityPack.HtmlDocument
$root = New-Object HtmlAgilityPack.HtmlNode

$doc.Load($destination)

$root = $doc.DocumentNode

$rows = $root.Descendants("tr")

foreach ($row in $rows){
    $cells = $row.Descendants("td")
	
	$cell0 = $cells[0].InnerText
	$cell1 = $cells[1].InnerText
		
	Write-Host "$cell0 - $cell1"

}

This is a very robust parsing engine and I’ve only scratched the surface, there’s several ways to interact with HTML content using the HTML Agility Pack. Its definitely a great addition to anyone’s tool box.

My New Stories

March 2016 Web Hosting Deals
Powershell AD Group Management
Troubleshooting 403