Menu
Business.com aims to help business owners make informed decisions to support and grow their companies. We research and recommend products and services suitable for various business types, investing thousands of hours each year in this process.
As a business, we need to generate revenue to sustain our content. We have financial relationships with some companies we cover, earning commissions when readers purchase from our partners or share information about their needs. These relationships do not dictate our advice and recommendations. Our editorial team independently evaluates and recommends products and services based on their research and expertise. Learn more about our process and partners here.
Write a tool in PowerShell to gather all the data from a web page.
Web scraping tools are helpful for gathering data from various web pages. For example, price comparison sites that share the best deals usually grab their information from specific feeds e-tailers set up for that purpose. However, not all online sellers make price feeds available. In these instances, comparison sites can use web scraping to grab the information they need.
Because website design varies and websites have unique structures, you must create customized scrapers to extract relevant data effectively. Luckily, scripting languages like PowerShell help you build reliable web scraping tools. You can use PowerShell modules to extract the information you need.
Web scraping is the process of parsing an HTML web page and gathering elements in a structured manner. Because HTML pages have specific structures, it’s possible to parse through them and retrieve semi-structured output. Note the use of the qualifier “semi.” Most pages aren’t perfectly formatted behind the scenes and may have website design mistakes, so your output may not be perfectly structured.
Still, scripting languages like Microsoft PowerShell — along with a little ingenuity and some trial and error — can help you build reliable web scraping tools that pull information from many different web pages.
It’s important to remember that web page structures vary widely. If even a small element is changed, your web scraping tool may no longer work. Focus on the basics first and then build more specific tools for particular web pages.
Federico Trotta, a technical writer and data scientist who has authored numerous articles on web scraping and data analysis, noted that PowerShell comes pre-installed on Windows, making it an accessible and flexible tool for users. “In particular, its integration with Windows makes it easily accessible without requiring additional installations or dependencies,” Trotta explained. “Additionally, its compatibility with .NET libraries provides a layer of extensibility for more advanced needs.”
Still, Trotta cautioned that PowerShell may not be suitable for more complex projects. “When it gets more complex, use different tools or technologies,” Trotta advised. “One of the main limitations is that with PowerShell, you can only scrape static HTML content. When pages have dynamically loaded content from JavaScript, you can overcome this by using Selenium.”
The command of choice is Invoke-WebRequest. This command should be a staple in your web scraping arsenal. It simplifies pulling down web page data and allows you to focus on parsing the data you need.
Trotta emphasized the importance of mastering this method. “To tie to PowerShell, in the beginning, I would suggest learning the methods Invoke-WebRequest and Invoke-RestMethod, as these cmdlets form the backbone of most PowerShell scraping scripts,” Trotta explained. “In particular, the Invoke-WebRequest cmdlet gets content from a web page on the internet; the Invoke-RestMethod cmdlet, instead, sends HTTP and HTTPS requests to REST web services that return richly structured data.”
With Invoke-WebRequest, let’s explore how a web scraper views a web page and extracts its content.
To get started, let’s use a simple web page everyone is familiar with — Google.com — and see how a web scraping tool views it.
First, pass Google.com to the Uri parameter of Invoke-WebRequest and inspect the output.
$google = Invoke-WebRequest –Uri google.com
This is a representation of the entire Google.com page, all wrapped up in an object for you.
Now, let’s see what information you can pull from this web page. For example, say you need to find all the links on the page. To do this, you’d reference the Links property. This will enumerate the various properties of each link on the page.
Perhaps you just want to see the URL that it links to:
How about the anchor text and the URL? Since this is just an object, it’s easy to pull information like this:
You can also see what the infamous Google.com form with the input box looks like under the hood:
Let’s take this one step further and download information from a web page. For example, perhaps you want to download all images on the page. To do this, we’ll also use the –UseBasicParsing parameter. This command is faster because Invoke-WebRequest doesn’t crawl the DOM.
For another example, here’s how to use PowerShell to enumerate all images on the CNN.com website and download them to your local computer.
$cnn = Invoke-WebRequest –Uri cnn.com –UseBasicParsing
Now let’s figure out each URL on which the image is hosted.
Once you have the URLs, you can use Invoke-Request again. However, this time, you’ll use the –OutFile parameter to send the response to a file.
@($cnn.Images.src).foreach({
$fileName = $_ | Split-Path -Leaf
Write-Host “Downloading image file $fileName”
Invoke-WebRequest -Uri $_ -OutFile “C:$fileName”
Write-Host ‘Image download complete’
})
In this case, you saved the images directly to my C:, but you can easily change this location to a different one. With PowerShell’s ability to manage file system ACLs, you can save images to your directory of choice.
If you’d like to test the images directly from PowerShell, use the Invoke-Item command to pull up the image’s associated viewer. Below, you can see that the Invoke-WebRequest pulled down an image from CNN.com with the word “bleacher.”
Use the code in this article as a template to build your own tool. For example, you could build a PowerShell function called Invoke-WebScrape with a few parameters like –Url or –Links. Once you have the basics down, you can easily create a customized tool to apply in many different ways.
If you’re new to web scraping, Trotta suggests starting with more foundational skills. “Familiarity with HTML structure and basic CSS selectors is fundamental to parsing web content. Without this familiarity, you cannot scrape web pages,” Trotta explained. “Focus on small, manageable projects at first, such as extracting headlines from a news website, to build confidence. Then, scale to improve.”
Trotta suggests incorporating the following features to improve performance and reliability in PowerShell web scraping scripts:
With these approaches, developers can create scripts that are efficient and resilient.
Mark Fairlie contributed to this article.