MENU
Business.com aims to help business owners make informed decisions to support and grow their companies. We research and recommend products and services suitable for various business types, investing thousands of hours each year in this process.
As a business, we need to generate revenue to sustain our content. We have financial relationships with some companies we cover, earning commissions when readers purchase from our partners or share information about their needs. These relationships do not dictate our advice and recommendations. Our editorial team independently evaluates and recommends products and services based on their research and expertise. Learn more about our process and partners here.
Write a tool in PowerShell to gather all the data from a webpage.
Web scraping tools are useful for gathering data from various webpages. For example, price comparison sites that share the best deals usually grab their information from specific feeds e-tailers set up for that purpose. However, not all online sellers make price feeds available. In these instances, comparison sites can use web scraping to grab the information they need.
Because website design varies and websites all have unique structures, you must create customized scrapers. Luckily, scripting languages like PowerShell help you build reliable web scraping tools. Use PowerShell modules to extract the information you need.
Web scraping is the art of parsing an HTML webpage and gathering elements in a structured manner. Because HTML pages have specific structures, it’s possible to parse through them and retrieve semi-structured output. Note the use of the qualifier “semi.” Most pages aren’t perfectly formatted behind the scenes and may hold website design mistakes, so your output may not be perfectly structured.
Still, scripting languages like Microsoft PowerShell – along with a little ingenuity and some trial and error – help you build reliable web scraping tools to pull information from many different webpages.
It’s important to remember that webpage structures vary wildly. If even a small element is changed, your web scraping tool may no longer work. Focus on the basics first and then build more specific tools for particular webpages.
The command of choice is Invoke-WebRequest. This command should be a staple in your web scraping arsenal. It simplifies pulling down webpage data and allows you to focus on parsing the data you need.
To get started, let’s use a simple web page everyone is familiar with — Google.com — and see how a web scraping tool views it.
First, pass Google.com to the Uri parameter of Invoke-WebRequest and inspect the output.
$google = Invoke-WebRequest –Uri google.com
This is a representation of the entire Google.com page, all wrapped up in an object for you.
Now, let’s see what information you can pull from this webpage. For example, say you need to find all the links on the page. To do this, you’d reference the Links property. This will enumerate various properties of each link on the page.
Perhaps you just want to see the URL that it links to:
How about the anchor text and the URL? Since this is just an object, it’s easy to pull information like this:
You can also see what the infamous Google.com form with the input box looks like under the hood:
Let’s take this one step further and download information from a webpage. For example, perhaps you want to download all images on the page. To do this, we’ll also use the –UseBasicParsing parameter. This command is faster because Invoke-WebRequest doesn’t crawl the DOM.
For another example, here’s how to use PowerShell to enumerate all images on the CNN.com website and download them to your local computer.
$cnn = Invoke-WebRequest –Uri cnn.com –UseBasicParsing
Now let’s figure out each URL that the image is hosted on.
Once you have the URLs, all you need to do is use Invoke-Request again. However, this time, you’ll use the –OutFile parameter to send the response to a file.
@($cnn.Images.src).foreach({
$fileName = $_ | Split-Path -Leaf
Write-Host “Downloading image file $fileName”
Invoke-WebRequest -Uri $_ -OutFile “C:$fileName”
Write-Host ‘Image download complete’
})
In this case, you saved the images directly to my C: — but you can easily change this location to a different one. With PowerShell’s ability to manage file system ACLs, you get the freedom to save images to your directory of choice.
If you’d like to test the images directly from PowerShell, use the Invoke-Item command to pull up the image’s associated viewer. You can see below that Invoke-WebRequest pulled down an image from CNN.com with the word “bleacher.”
Use the code in this article as a template to build your own tool. For example, you could build a PowerShell function called Invoke-WebScrape with a few parameters like –Url or –Links. Once you have the basics down, you can easily create a customized tool to apply in many different ways.
Mark Fairlie contributed to this article.