Web Scraping

Python Tools
Author

Kerry Back

Published

May 19, 2025

Getting data from websites is a task for which “AI + coding” can really increase efficiency. I use two examples to illustrate web scraping. The first is to extract a table from a web page. The second is to download multiple csv files from a website. It is fair to warn students that there are other examples that are more difficult, a point to which I’ll return at the end.

This Wikipedia page maintains a list of the S&P 500 companies with information like ticker, sector, headquarters location, etc. If you ask Julius to go to the page, extract the table, and save it as an Excel file, it is highly likely that the task will be quickly accomplished with the aid of the pandas read_html function. Compared to manually typing the information into an Excel worksheet, this is an impressive time saver.

It is worthwhile to explain to students how the read_html function is able to accomplish this. In my experience, many students have never seen HTML code and are unaware of how a web browser works. So, I right-click on the Wikipedia page, choose “View page source” to open up a window showing the HTML code, and explain that a web browser is a program that interprets the code to render the page that we see. Use “CTRL-F” on the page of HTML code to open up a search box as usual and input “<table” to locate the table tag. I explain that the read_html function does the same thing. It then interprets the tags

to build the table in Excel.

The second example I show students is getting short interest data from FINRA. FINRA has an API, but there is a solution that is easier for students to follow and that generalizes to other contexts. The page https://www.finra.org/finra-data/browse-catalog/equity-short-interest/files provides a point-and-click interface to download the data for all stocks on a single reporting date. There are two reporting dates each month. Links to both dates are shown for the month selected in the dropdown menu. Clicking a date initiates a download of a csv file containing the data for that date. The filename convention is very simple - for example, the link for April 30, 2025 initiates the download of a file named “shrt20250430.csv.” What is the full URL of that file? If we find it, then we can go to it directly to get the data and avoid the point-and-click.

I’ll use an example of the FINRA download page when April, 2025 is selected from the dropdown menu, but other cases follow the same format. Right-click on the page and select “View page source” as discussed before. Use CTRL-F to bring up the search box and insert “20250430.” That brings us to the URL for the shrt20250430.csv file. We see that the URL is https://cdn.finra.org/equity/otcmarket/biweekly/shrt20250430.csv. Inserting that URL in the address bar of a browser initiates the download of the file.1

Our goal is to automate the downloading of these files by looping over dates. The reporting days appear to be the 15th and last day of each month or the last trading days prior to those days. Instead of finding those days, we can ask Julius to try every possible day in a try-except block.2 We can ask Julius to loop over all dates in a given date range, inserting the date in YYYYMMDD format into the URL above and using try-except blocks, to download and concatenate the files, and to save the complete data as a csv file.3

Unfortunately, web scraping is often not this easy, because many pages are dynamically generated using JavaScript. This means that the data we want is not part of the source code that we (and python) can see with “View page source” but instead is retrieved from elsewhere by executing the JavaScript code that appears along with HTML code in the source code. A good example is this page that can be found by following links on the SEC EDGAR website to the 10-K filed by Tesla with the SEC on April 30, 2025. The first table on that page is a listing of the directors of Tesla. If we open the source code with “View page source” and use CTRL-F to search for “Elon Musk,” we will get “0 results.” This is despite the fact that Musk appears in the table as a director. The reason is that the entire web page is filled by the JavaScript code having the ‘script language=“JavaScript”’ tag.

There are python tools for executing the JavaScript code and retrieving the data, in particular the selenium library, but I have never had the patience to complete that type of task.4 Perhaps AI agents using Anthropic’s Model Context Protocol (MCP) to control a web browser are the best tool for that type of exercise now. More on that later.


First published on finance-with-ai.org
Also on substack at kerryback.substack.com

Footnotes

  1. By the way, the “biweekly” in the URL is a misnomer; the reporting dates are twice monthly, not every other week.↩︎

  2. When an error (for example, trying to download a nonexisting file) occurs inside the “try” component of a try-except block, execution is not broken. Instead, the “except” clause is executed - here, the “except” action will be “pass.”↩︎

  3. Excel can read csv files. You could ask instead for an Excel file, but the data may surpass the 1 million row limit of Excel.↩︎

  4. It is not obvious that Julius is the best tool for this anyway, because it seems impossible to install the chromium webdriver in Julius and the available firefox webdriver seems to have some limitations.↩︎