Web scraping tools are useful for gathering data from various webpages. For example, price comparison sites that share the best deals usually grab their information from specific feeds e-tailers set up for that purpose. However, not all online sellers make price feeds available. In these instances, comparison sites can use web scraping to grab the information they need.
Because website design varies and websites all have unique structures, you must create customized scrapers. Luckily, scripting languages like PowerShell help you build reliable web scraping tools. Use PowerShell modules to extract the information you need.
Keep an eye on your competitors’ prices by creating a web scraper and using Windows Task Scheduler to monitor prices once daily. To run your scraper as part of a web application, host it on an IIS server and manage it with IIS application pools.
Web scraping is the art of parsing an HTML webpage and gathering elements in a structured manner. Because HTML pages have specific structures, it’s possible to parse through them and retrieve semi-structured output. Note the use of the qualifier “semi.” Most pages aren’t perfectly formatted behind the scenes and may hold website design mistakes, so your output may not be perfectly structured.
Still, scripting languages like Microsoft PowerShell – along with a little ingenuity and some trial and error – help you build reliable web scraping tools to pull information from many different webpages.
It’s important to remember that webpage structures vary wildly. If even a small element is changed, your web scraping tool may no longer work. Focus on the basics first and then build more specific tools for particular webpages.
The command of choice is Invoke-WebRequest. This command should be a staple in your web scraping arsenal. It simplifies pulling down webpage data and allows you to focus on parsing the data you need.
To get started, let’s use a simple web page everyone is familiar with — Google.com — and see how a web scraping tool views it.
First, pass Google.com to the Uri parameter of Invoke-WebRequest and inspect the output.
$google = Invoke-WebRequest –Uri google.com
This is a representation of the entire Google.com page, all wrapped up in an object for you.
The Invoke-WebRequest command is highly versatile. It works on FTP and HTTP sites, which gives you more choices on where to source information and data.
Now, let’s see what information you can pull from this webpage. For example, say you need to find all the links on the page. To do this, you’d reference the Links property. This will enumerate various properties of each link on the page.
Perhaps you just want to see the URL that it links to:
How about the anchor text and the URL? Since this is just an object, it’s easy to pull information like this:
You can also see what the infamous Google.com form with the input box looks like under the hood:
If your scraper stops working, the website structure has likely changed. Unfortunately, you’ll have to build a new web scraper.
Let’s take this one step further and download information from a webpage. For example, perhaps you want to download all images on the page. To do this, we’ll also use the –UseBasicParsing parameter. This command is faster because Invoke-WebRequest doesn’t crawl the DOM.
For another example, here’s how to use PowerShell to enumerate all images on the CNN.com website and download them to your local computer.
$cnn = Invoke-WebRequest –Uri cnn.com –UseBasicParsing
Now let’s figure out each URL that the image is hosted on.
Once you have the URLs, all you need to do is use Invoke-Request again. However, this time, you’ll use the –OutFile parameter to send the response to a file.
$fileName = $_ | Split-Path -Leaf
Write-Host “Downloading image file $fileName”
Invoke-WebRequest -Uri $_ -OutFile “C:$fileName”
Write-Host ‘Image download complete’
In this case, you saved the images directly to my C: — but you can easily change this location to a different one. With PowerShell’s ability to manage file system ACLs, you get the freedom to save images to your directory of choice.
If you’d like to test the images directly from PowerShell, use the Invoke-Item command to pull up the image’s associated viewer. You can see below that Invoke-WebRequest pulled down an image from CNN.com with the word “bleacher.”
Use the code in this article as a template to build your own tool. For example, you could build a PowerShell function called Invoke-WebScrape with a few parameters like –Url or –Links. Once you have the basics down, you can easily create a customized tool to apply in many different ways.
Mark Fairlie contributed to this article.