#140: Create a Basic Link Checker

With requests, Beautiful Soup, and the ultimate sitemap parser we have everything together to create our own basic link checker. Let us combine the three parts to something useful and check if our web site has broken links.

Preparation

Before we can use the modules, we need to import them into our application:

import requests
from bs4 import BeautifulSoup
from usp.tree import sitemap_tree_for_homepage
from typing import NamedTuple

I not only want to know what links do not work, I want to be able to find and fix them on my web site. Therefore, I need to know on what page the broken link is. For that I use a named tuple like this one:

class Page(NamedTuple):
    url: str
    text: str

(The link URL we want to check will be the key in the dictionary we introduce in a moment.)

Fetch the sitemap

We can use the ultimate sitemap parser to fetch all pages of our web site:

def read_sitemap(domain):
    tree = sitemap_tree_for_homepage(domain)
    pages = [page.url for page in tree.all_pages()]
    return pages

Extract the links

We can now iterate through all our initial pages, use requests to load the page and Beautiful Soup to extract all links we want to check:

def find_links(pages):
    all_links = {}

    for page in pages:
        content = requests.get(page)
        soup = BeautifulSoup(content.text, "html.parser")
        links = soup.find_all("a")

        for link in links:
            if link.get("href").startswith("#"):
                continue

            if  link.get("rel") is not None and "nofollow" in link.get("rel"):
                continue

            link_text = link.get_text().strip()
            link_target = link.get("href")
            source = Page(page, link_text)

            if not link_target.lower().startswith("http"):
                link_target = page + link_target

            if link_target in all_links:
                all_links[link_target].append(source)
            else:
                all_links[link_target] = [source]

    return all_links

We store the links to check in a dictionary and use a list of our pages as its value. That allows us to only check a remote URL only once and keep enough context around to know all the places where we link to that remote URL.

We skip all links to other parts on the same page and stay away from links marked as "nofollow".

Check the links

We now can iterate through the keys of our dictionary with all the links and use a HEAD request to check if there is something behind that URL. We need a little bit of error handling so that our link checker keeps running if a URL is not reachable:

def check_links(all_links):
    status = {}

    for key in all_links:
        try:
            print(f"working on {key}")
            page = requests.head(key, timeout=5)
            code = page.status_code
        except ConnectionRefusedError:
            code = "ConnectionRefusedError"
        except Exception:
            code = "Exception"

        if code in status:
            status[code].append(key)
        else:
            status[code] = [key]

    return status

Every request will get us a status code or the exception name. Whatever we get, we use it as a key in another dictionary and reuse the idea of a list of URLs as the value part of the dictionary.

Report the result

As a final step we need to combine the result of the check with our pages that contain those links. For that we iterate through our dictionaries and their lists before we can write everything to a file:

def create_report(all_links, result):
    with open("report_link_status.txt", "w", encoding="utf-8") as f:
        for code in result:
            f.write(f"- {code}\n")
            for page in result[code]:
                f.write(f"\t - {page}\n")
                for source in all_links[page]:
                    f.write(f"\t\t - {source.url} [{source.text}]\n")

Orchestrate the different parts

The only thing left to do is to use the main block to orchestrate the different parts in the right order:

if __name__ == "__main__":
    pages = read_sitemap("https://requests.readthedocs.io/")
    all_links = find_links(pages)
    result = check_links(all_links)
    create_report(all_links, result)

When we run the script, it should create us a report like this one and persist it in the file report_link_status.txt:

- 200
     - https://requests.readthedocs.io/en/stable/user/install/#install
         - https://requests.readthedocs.io/en/stable/ [Installation]
     - https://pepy.tech/project/requests
         - https://requests.readthedocs.io/en/stable/ []
     - https://pypi.org/project/requests/
         - https://requests.readthedocs.io/en/stable/ []
         - https://requests.readthedocs.io/en/stable/ []
         - https://requests.readthedocs.io/en/stable/ []
         - https://requests.readthedocs.io/en/stable/ [Requests @ PyPI]

Extension points

If you want to continue with this basic link checker, you could address these shortcomings on your own:

Change the logger for the ultimate sitemap parser to write into a file instead of the console
For bigger sites you want to have a progress bar to see that your link checker is still alive
Not all web sites support HEAD requests; for that a fallback to GET may be in order
The minimalistic error handling has room for improvement
You can turn the domain and the report file into parameters that you can change when you call your script in the command line

The basic link checker we created in this post is a great example on how you can leverage different Python libraries to create something useful. You can take this code and add more features as you seam fit. You will be surprised how far you can go with that. Next week we look at a much lower level of network traffic as we try to read the TLS/SSL certificates to figure out when they reach their end of life.