Businesses rely on the internet to get all kinds of critical information—from contact information, shipping tracking, competitor pricing, data from portals, and more. And while these tasks seem simple, searching websites and portals and copying and pasting the information into Excel can quickly take up a lot of your precious time. Plus, manually entering data into a spreadsheet is highly prone to human error. But with robotic process automation (RPA), you can streamline these repetitive tasks with automated data scraping from websites. Show
Automated data scraping collects data across many sources and pulls it into one spot—like an Excel spreadsheet—to eliminate errors and give you time back to work on more critical projects. Here are just some of the ways real companies are using automated data scraping:
Text In the video above, you’ll see an Automate bot running a task that enters UPS tracking numbers into the UPS website, performs automated data scraping to get delivery tracking information, and enters it into an Excel file. After the task runs, it goes on to show how that task was built. All but step 1 are shown in the video. Step 1: Download an Automate trial Step 2: Build the task by starting with variables. (If you need a basic primer on how to build Automate tasks, Automate Academy is a great place to learn.) In this task, you’ll add variables for file names, rows, etc. Notice that this task builder is drag and drop, with no coding required! Step 3: Open Excel workbook to get tracking numbers. You’ll store this as a dataset to use later on. Step 4: Add a step to create a report workbook to write the dataset to. Step 5: Use the report workbook with tracking numbers and column headings in a web browser activity. Step 6: Identify which pieces of information you need. This will include telling the Automate bot where to find the data you want scraped. Put this on a loop to go through all the tracking numbers to do automated data scraping from the UPS website into Excel. Step 7: For each piece of data you want scraped from the website, write the variable value to a cell in the workbook. This is just one example of Excel automation. There are so many other ways Automate and Excel can work together to take manual work off your plate. Excel’s Power Query (or Get & Transform since Excel 2016) is a great tool for building queries to get data from the web. Within a couple of minutes you can build a query that will pull data from a webpage and transform it into the desired format. This is great for getting data from a webpage that is updated frequently as you will be able easily refresh your query to pull the new data. Remember, if you’re not using Excel 2016 or later, then you’ll need to install the power query add-in. Data to ExtractIn this post we’re going to take a look at how we can pull data from a series of similar pages. I’m a big MMA fan, so the example we’re going to look at is getting a list of all UFC results from Wikipedia. If you visit the Wikipedia page for UFC events there’s a table of Past Events. If you click on one of the events you’ll see a results table. If you look at a few more events, you’ll notice the structure is the exact same and they all have a results table. This is the data I want to get, but from all 400+ events listed in the past event section. If the number of pages was any larger, you might be better off using another tool like Python, but we’re going to be using Power Query. Create a Query FunctionFirst, we will create a query to extract the data on one page. We will then turn this into a function query where the input is an event page URL. This way we can apply the query to each URL in a list of all the URL’s. Head to the Data tab in the ribbon and press the From Web button under the Get & Transform section. If you’re working with Excel 2013 or earlier via the add-in, then this will be found under the Power Query tab. Enter the URL and press the Ok button.
Excel will connect with the page and the Navigator dialog box will open.
Rename the query to fGetWikiResults. This will be the name we call to use our query function later on. Now we can edit our query to turn it into a query function. Go to the View tab and press the Advanced Editor button. This will allow us to edit the code that Excel has created to extract the data from this URL. We will need to edit this code to the following. The parts that need to be added/changed are highlighted in red.
Press the Done button when finished editing the query. This will turn our query into a parametrized query with the URL as an input. You should see the data preview in the query editor has been replaced with a parameter input. We don’t need to enter anything here and we can just leave it blank. We can then save our query function by going to the Home tab and pressing the Close & Load button. You should now see the fGetWikiResults query function in the Queries & Connections window. Get a List of URL’sNow we will need to get our list of event page URL’s from the Past Events page. We could use power query to import this table but this would just pull in the text and not the underlying hyperlink. The best way to get the list of URL’s is to parse the source code from the page. You can view any webpage’s source code by pressing Ctrl + U from the Chrome browser. You’ll need to be fairly familiar with HTML to find what you’re looking for. The first couple lines of HTML we are interested in looks like this. I have highlighted the hyperlinks we’re interested in to demonstrate where they are. You can parse these out in another Excel workbook using some filters and basic text formula. We will also need to concatenate the starting part of the address (ie. https://en.wikipedia.org/wiki/UFC_217).
Once we have the full list of event URL’s, we can turn the list into an Excel Table using the Ctrl + T keyboard shortcut and name it URL_List. Use the Query Function on our URL ListWe are now ready to use the fGetWikiResults query function on our list of event URL’s. Create a query based on the URL_List table. Select a cell in the table and go to the Data tab in the ribbon and press the From Table/Range button under the Get & Transform section. Now we will add a custom column to the query. This is where we’ll invoke our fGetWikiResults query function. Go to the Add Column tab and press the Custom Column button. Add a New column name to the custom column and then add the Custom column formula fGetWikiResults([URL]). The new custom column will contain a Table for each URL and we will need to expand this table to see the results. Left click on the Results Data column filter icon seen in the column heading. Select Expand from the menu and press the OK button. Some of the column headings were missing in our source data, so we can rename them. Double left click on the column heading to rename it. We can now Close & Load the query and the results data will load into a new sheet. This will take a good few minutes so be patient. This is why you should probably start considering Python or similar tools if you have any more pages than this example.
About the AuthorJohn is a Microsoft MVP and qualified actuary with over 15 years of experience. He has worked in a variety of industries, including insurance, ad tech, and most recently Power Platform consulting. He is a keen problem solver and has a passion for using technology to make businesses more efficient. How do I get data from URL to Excel?Select Data > Get & Transform > From Web. Press CTRL+V to paste the URL into the text box, and then select OK. In the Navigator pane, under Display Options, select the Results table.
Can Excel pull live data from a website?Quick Importing of Live Data
You can easily import a table of data from a web page into Excel, and regularly update the table with live data. Open a worksheet in Excel.
How do I extract information from a link?How to extract data from a website. Code a web scraper with Python. It is possible to quickly build software with any general-purpose programming language like Java, JavaScript, PHP, C, C#, and so on. ... . Use a data service. ... . Use Excel for data extraction. ... . Web scraping tools.. How do I automatically extract data from a website into Excel at regular intervals?To extract data from websites, you can take advantage of data extraction tools like Octoparse. These tools can pull data from websites automatically and save them into many formats such as Excel, JSON, CSV, HTML, or to your own database via APIs.
|