Modern Screen Scraping With HtmlAgilityPack
In the early days of the Internet, before web services were as common as Starbucks, one of the few ways to pull data from other systems was through screen scraping web pages. I helped with at least one of these apps to scrape stock prices from Yahoo. To say it was clunky and hard to maintain is an understatement. Now days every system exposes an API through a web service, or even data brokers like Biztalk, and we’re able to orchestrate beautiful data flows….
But what happens when we don’t have any clean API to extract data from a remote system. Sometimes the best route is the most direct route, it may not pretty but screen scraping still gets the job done. And as Alec said, “Always Be Closing”. Luckily, there’s a lot better options now to accomplish HTML screen scraping.
Today there’s a great framework available on Codeplex called HTML Agility Pack. You can also find updates on their Twitter Feed. This .NET library offers a simple method for parsing and even modifying HTML files. But lets focus on how we can use it to extract data from a web page. You can also use this library to parse and then update the HTML files.
Lets assume we have a page with the following HTML.
<TABLE ALIGN=center border=1 bgcolor=lightblue width=80%> <tr><th>ATTRIBUTE</th><th>VALUE</th><th>ATTRIBUTE DESCRIPTION</th></tr> <tr><td>Name</td><td>HR System</td><td>Name of Application</td></tr> <tr> <td>Version</td><td>4.2.101</td><td>Code Version</td> </tr> <tr> <td>HOST</td><td>ATL0WAPP001</a></td><td>Name of machine</td> </tr> </TABLE>
So lets grab the latest Nuget package, you can search for it in Visual Studio, or find it here HTMLAgilityPack 1.4.6. Some VB.NET code that will parse the table cell data using HTML Agility Pack would look like this:
Using client As New Net.WebClient Dim filename As String = IO.Path.GetTempFileName client.Credentials = CredentialCache.DefaultNetworkCredentials client.DownloadFile(_URL, filename) Dim doc = New HtmlAgilityPack.HtmlDocument doc.Load(filename) Dim root As HtmlAgilityPack.HtmlNode = doc.DocumentNode Dim nodes As List(Of HtmlAgilityPack.HtmlNode) = root.Descendants("tr").ToList For Each node In nodes Dim tdlist As List(Of HtmlAgilityPack.HtmlNode) tdlist = node.Descendants("td").ToList Console.WriteLine(tdlist(0).InnerText & ": " & tdlist(1).InnerText) Next End Using
Or if you’d rather go the Powershell route, you can write something like this:
$HAPDllPath = "C:\Users\...\packages\HtmlAgilityPack.1.4.6\lib\Net45\HtmlAgilityPack.dll" $a = [Reflection.Assembly]::LoadFile($HAPDllPath) $source = "http:/localhost/mytable.html" $destination = "c:\temp\myfile.htm" $client = New-Object System.Net.WebClient $client.Credentials = [System.Net.CredentialCache]::DefaultNetworkCredentials $client.DownloadFile($source, $destination) $doc = New-Object HtmlAgilityPack.HtmlDocument $root = New-Object HtmlAgilityPack.HtmlNode $doc.Load($destination) $root = $doc.DocumentNode $rows = $root.Descendants("tr") foreach ($row in $rows){ $cells = $row.Descendants("td") $cell0 = $cells[0].InnerText $cell1 = $cells[1].InnerText Write-Host "$cell0 - $cell1" }
This is a very robust parsing engine and I’ve only scratched the surface, there’s several ways to interact with HTML content using the HTML Agility Pack. Its definitely a great addition to anyone’s tool box.