Bash 404 checker script for broken pages and backlinks
Bash SEO automation

Build a Bash 404 Checker to Find Broken Pages and Internal Backlinks

This guide shows how to build a practical Bash 404 checker using curl, grep and sed. The script scans your sitemap, checks page status codes, crawls internal links, reports broken backlinks and saves everything to a CSV file.

On this page
Overview

Why check for 404 pages and broken internal backlinks?

A broken page is bad enough, but a broken internal backlink is worse because it means your own site is actively sending visitors and search engines to a dead URL. That hurts user experience, wastes crawl budget and makes your site look less maintained than it really is.

This Bash script is designed for real site maintenance. It does not just check a single URL. It reads URLs from your sitemap, checks each page, extracts internal links from those pages, then checks those links too. The result is a simple terminal summary plus a CSV report you can keep, filter or share.

Find broken pages

Detect URLs from your sitemap returning 404, 500 or connection failures.

Find broken backlinks

Catch internal links pointing to old slugs, missing articles or renamed tools.

Save a report

Export broken pages and backlinks to site-404-report.csv.

Features

What this Bash 404 checker does

CheckWhat it meansWhy it helps
Sitemap scanReads URLs from /sitemap.xml.Starts from the pages search engines are likely to crawl.
Status checkUses curl to check HTTP response codes.Spots 404, 500 and timeout issues quickly.
Internal link crawlExtracts <a href="..."> links from each page.Finds backlinks inside your own content that point to broken URLs.
CSV outputWrites broken results to site-404-report.csv.Makes it easier to review, sort and fix issues.
Summary fileWrites counts to site-404-summary.txt.Gives you a quick before and after score when fixing links.

This is intentionally a Bash script, not a full crawler. It is lightweight, easy to understand and good enough for regular checks on small to medium content sites.

Setup

Install the script

Create a file called site-404-checker.sh.

nano site-404-checker.sh

Paste the full script from the next section, then make it executable.

chmod +x site-404-checker.sh

You need bash, curl, grep, sed, wc and sort. These are usually already available on most Linux servers.

Full script

The complete Bash 404 and broken backlink checker

This version outputs to a CSV file and also prints the number of broken pages, unique 404 URLs and broken backlinks in the terminal.

#!/usr/bin/env bash

# Site 404 and Broken Backlink Checker
# Usage:
#   ./site-404-checker.sh yourdomain.com
#   ./site-404-checker.sh https://yourdomain.com
#   ./site-404-checker.sh urls.txt
#   ./site-404-checker.sh yourdomain.com --pages-only
#   ./site-404-checker.sh yourdomain.com --include-assets

set -u

INPUT="${1:-}"
MODE="${2:-}"
REPORT="site-404-report.csv"
SUMMARY="site-404-summary.txt"
CRAWL_INTERNAL_LINKS=1
INCLUDE_ASSETS=0
MAX_PAGES=1000

if [ -z "$INPUT" ]; then
    echo "Usage: $0 <domain|url|urls.txt> [--pages-only|--include-assets]"
    exit 1
fi

if [ "$MODE" = "--pages-only" ]; then
    CRAWL_INTERNAL_LINKS=0
fi

if [ "$MODE" = "--include-assets" ]; then
    INCLUDE_ASSETS=1
fi

TMP_DIR=$(mktemp -d)
URLS_FILE="$TMP_DIR/urls.txt"
trap 'rm -rf "$TMP_DIR"' EXIT

csv_escape() {
    local value="$1"
    value=${value//\"/\"\"}
    printf '"%s"' "$value"
}

write_csv_row() {
    csv_escape "$1"; printf ','
    csv_escape "$2"; printf ','
    csv_escape "$3"; printf ','
    csv_escape "$4"; printf ','
    csv_escape "$5"; printf '\n'
}

normalise_base() {
    local value="$1"

    if [[ "$value" != http://* && "$value" != https://* ]]; then
        value="https://$value"
    fi

    value="${value%/}"
    echo "$value" | sed -E 's#^(https?://[^/]+).*$#\1#'
}

get_domain() {
    echo "$1" | sed -E 's#^https?://([^/]+).*$#\1#'
}

BASE="$(normalise_base "$INPUT")"
DOMAIN="$(get_domain "$BASE")"
USER_AGENT="Site404Checker/1.0 (+https://$DOMAIN)"

get_status() {
    local url="$1"

    curl -Ls -o /dev/null -w "%{http_code}" \
        --connect-timeout 10 \
        --max-time 20 \
        -A "$USER_AGENT" \
        "$url"
}

is_bad_status() {
    local status="$1"

    if [ "$status" = "000" ]; then
        return 0
    fi

    if [[ "$status" =~ ^[0-9]+$ ]] && [ "$status" -ge 400 ]; then
        return 0
    fi

    return 1
}

fetch_sitemap_urls() {
    local sitemap_url="$1"
    local depth="$2"

    if [ "$depth" -gt 3 ]; then
        return
    fi

    local sitemap_content
    sitemap_content=$(curl -Ls --connect-timeout 10 --max-time 20 -A "$USER_AGENT" "$sitemap_url")

    echo "$sitemap_content" |
        grep -oiE '<loc>[^<]+</loc>' |
        sed -E 's#</?loc>##Ig' |
        while read -r loc; do
            [ -z "$loc" ] && continue

            if [[ "$loc" == *.xml || "$loc" == *sitemap* ]]; then
                fetch_sitemap_urls "$loc" $((depth + 1))
            else
                echo "$loc"
            fi
        done
}

should_skip_url() {
    local url="$1"

    # Cloudflare email protection URLs are not real content pages.
    if echo "$url" | grep -qiE '/cdn-cgi/'; then
        return 0
    fi

    if [ "$INCLUDE_ASSETS" -eq 1 ]; then
        return 1
    fi

    if echo "$url" | grep -qiE '\.(jpg|jpeg|png|gif|webp|svg|ico|css|js|pdf|zip|mp4|mp3|woff|woff2|ttf)(\?|$)'; then
        return 0
    fi

    return 1
}

normalise_link() {
    local source_url="$1"
    local raw_link="$2"
    local url link_domain page_base

    raw_link="${raw_link%%#*}"
    raw_link="${raw_link//$'\r'/}"
    raw_link="${raw_link//$'\n'/}"
    raw_link="${raw_link//&amp;/&}"

    [ -z "$raw_link" ] && return

    if echo "$raw_link" | grep -qiE '^(mailto:|tel:|javascript:|data:)'; then
        return
    fi

    if [[ "$raw_link" == //* ]]; then
        url="https:$raw_link"
    elif [[ "$raw_link" == http://* || "$raw_link" == https://* ]]; then
        url="$raw_link"
    elif [[ "$raw_link" == /* ]]; then
        url="$BASE$raw_link"
    else
        page_base="${source_url%/*}/"
        url="$page_base$raw_link"
    fi

    link_domain="$(get_domain "$url")"

    if [ "$link_domain" != "$DOMAIN" ]; then
        return
    fi

    if should_skip_url "$url"; then
        return
    fi

    echo "$url"
}

extract_internal_links() {
    local source_url="$1"

    curl -Ls --connect-timeout 10 --max-time 20 -A "$USER_AGENT" "$source_url" |
        tr '\n' ' ' |
        grep -oiE "<a[^>]+href=[\"'][^\"' >]+" |
        sed -E "s/^.*href=[\"']//I" |
        while read -r link; do
            normalise_link "$source_url" "$link"
        done |
        sort -u
}

write_csv_row "Type" "Source URL" "Checked URL" "HTTP Status" "Issue" > "$REPORT"

if [ -f "$INPUT" ]; then
    grep -E '^https?://' "$INPUT" | sort -u > "$URLS_FILE"
else
    fetch_sitemap_urls "$BASE/sitemap.xml" 0 | sort -u > "$URLS_FILE"

    if [ ! -s "$URLS_FILE" ]; then
        echo "$BASE/" > "$URLS_FILE"
    fi
fi

TOTAL_URLS=$(wc -l < "$URLS_FILE" | tr -d ' ')
PAGE_COUNT=0
BROKEN_PAGES=0
BROKEN_LINKS=0
NOT_FOUND_404=0
BROKEN_BACKLINKS=0
CONNECTION_ERRORS=0

declare -A STATUS_CACHE
declare -A UNIQUE_404

check_status_cached() {
    local url="$1"
    local status

    if [[ -n "${STATUS_CACHE[$url]+x}" ]]; then
        echo "${STATUS_CACHE[$url]}"
        return
    fi

    status="$(get_status "$url")"
    STATUS_CACHE[$url]="$status"
    echo "$status"
}

count_404_once() {
    local url="$1"
    local status="$2"

    if [ "$status" = "404" ] && [[ -z "${UNIQUE_404[$url]+x}" ]]; then
        UNIQUE_404[$url]=1
        NOT_FOUND_404=$((NOT_FOUND_404 + 1))
    fi
}

echo "Checking $TOTAL_URLS page URL(s) for $DOMAIN"
echo

while read -r page_url; do
    [ -z "$page_url" ] && continue

    PAGE_COUNT=$((PAGE_COUNT + 1))

    if [ "$PAGE_COUNT" -gt "$MAX_PAGES" ]; then
        echo "Reached page limit of $MAX_PAGES. Increase MAX_PAGES in the script if needed."
        break
    fi

    page_status="$(check_status_cached "$page_url")"
    count_404_once "$page_url" "$page_status"

    echo "[$page_status] $page_url"

    if [ "$page_status" = "000" ]; then
        CONNECTION_ERRORS=$((CONNECTION_ERRORS + 1))
    fi

    if is_bad_status "$page_status"; then
        BROKEN_PAGES=$((BROKEN_PAGES + 1))
        write_csv_row "Page" "" "$page_url" "$page_status" "Broken page" >> "$REPORT"
        continue
    fi

    if [ "$CRAWL_INTERNAL_LINKS" -eq 1 ]; then
        while read -r found_link; do
            [ -z "$found_link" ] && continue

            link_status="$(check_status_cached "$found_link")"
            count_404_once "$found_link" "$link_status"

            if is_bad_status "$link_status"; then
                BROKEN_LINKS=$((BROKEN_LINKS + 1))
                BROKEN_BACKLINKS=$((BROKEN_BACKLINKS + 1))

                if [ "$link_status" = "000" ]; then
                    CONNECTION_ERRORS=$((CONNECTION_ERRORS + 1))
                fi

                write_csv_row "Broken backlink" "$page_url" "$found_link" "$link_status" "Internal link points to broken URL" >> "$REPORT"
                echo "  Broken backlink [$link_status]: $found_link"
            fi
        done < <(extract_internal_links "$page_url")
    fi

done < "$URLS_FILE"

{
    echo
    echo "404 check complete."
    echo "Domain checked: $DOMAIN"
    echo "Pages checked: $PAGE_COUNT"
    echo "Broken pages: $BROKEN_PAGES"
    echo "404 URLs found: $NOT_FOUND_404"
    echo "Broken backlinks: $BROKEN_BACKLINKS"
    echo "Connection errors: $CONNECTION_ERRORS"
    echo "CSV report: $REPORT"
    echo "Summary report: $SUMMARY"
} | tee "$SUMMARY"
Example domain

In the examples below, yourdomain.com is a placeholder. Replace it with your own website domain, for example example.com. The sample results are illustrative so you can see what the script output looks like before running it on your own site.

Usage

How to run the 404 checker

Run it against a domain. You can include or omit https://.

./site-404-checker.sh yourdomain.com

To check only the URLs from the sitemap and skip crawling internal links, use --pages-only.

./site-404-checker.sh yourdomain.com --pages-only

To include assets such as images, CSS and JavaScript, use --include-assets. For most SEO checks, page links are the priority, so assets are skipped by default.

./site-404-checker.sh yourdomain.com --include-assets

You can also provide a manual URL list.

cat > urls.txt <<EOF
https://yourdomain.com/
https://yourdomain.com/blog
https://yourdomain.com/bash-scripting-hub
EOF

./site-404-checker.sh urls.txt
Example

Example terminal output

The script prints each sitemap page as it checks it. If a page contains a bad internal link, it prints that as a broken backlink underneath the source page.

Example output
Checking 77 page URL(s) for yourdomain.com

[200] https://yourdomain.com/
  Broken backlink [404]: https://yourdomain.com/old-firewalld-guide
  Broken backlink [404]: https://yourdomain.com/old-systemd-guide
[200] https://yourdomain.com/linux-troubleshooting-guide
[404] https://yourdomain.com/old-missing-page
[200] https://yourdomain.com/apache-log-analysis-cheat-sheet

404 check complete.
Domain checked: yourdomain.com
Pages checked: 77
Broken pages: 1
404 URLs found: 3
Broken backlinks: 2
Connection errors: 0
CSV report: site-404-report.csv
Summary report: site-404-summary.txt

In this example, the homepage is live, but it links to two old slugs that return 404. The script also found one sitemap URL that is itself returning 404.

Report

CSV report output

The full details are saved to site-404-report.csv. This makes it easier to see the source page, the broken target and the HTTP status.

site-404-report.csv
"Type","Source URL","Checked URL","HTTP Status","Issue"
"Broken backlink","https://yourdomain.com/","https://yourdomain.com/old-firewalld-guide","404","Internal link points to broken URL"
"Broken backlink","https://yourdomain.com/","https://yourdomain.com/old-systemd-guide","404","Internal link points to broken URL"
"Page","","https://yourdomain.com/old-missing-page","404","Broken page"

You can read the report directly in the terminal.

cat site-404-report.csv

Or format it into columns for a quick look.

column -s, -t site-404-report.csv
Breakdown

How the script works

1. Normalises the domain

If you run ./site-404-checker.sh example.com, the script turns that into https://example.com. This avoids the classic mistake where a bare domain is treated as a local filename.

2. Reads the sitemap

The fetch_sitemap_urls function reads /sitemap.xml and extracts every <loc> entry. It also supports sitemap indexes by following nested XML sitemap files.

3. Checks HTTP status codes

The get_status function uses curl with sensible timeouts and returns only the HTTP status code. A status of 000 usually means a connection or timeout problem rather than a normal HTTP response.

4. Extracts internal links

For every working page, the script extracts internal links from anchor tags. It ignores mailto:, tel:, javascript:, external domains and common asset files.

5. Caches checked URLs

If the same broken link appears on multiple pages, the script does not need to request it again every time. The STATUS_CACHE array stores results during the run.

6. Counts unique 404 URLs

The UNIQUE_404 array prevents the same missing URL being counted repeatedly. You still see every broken backlink in the CSV, but the summary tells you how many unique 404 URLs exist.

Fixing results

How to fix the broken backlinks it finds

Once you have the report, there are usually three types of fix.

Restore the missing page

If the content should exist, recreate the article, quiz, cheat sheet or command builder.

Update the internal link

If the page moved, update the source page so it links directly to the new slug.

Add a 301 redirect

If the old URL has value, redirect it to the closest relevant live page.

For example, if your report finds an old slug called /systemd-guide, but the real guide is now /systemd-guide-services-timers-logs, you can add a redirect in .htaccess.

RewriteRule ^systemd-guide/?$ /systemd-guide-services-timers-logs [R=301,L]

Then update the internal link in your HTML or PHP files so future crawls do not keep finding the old URL.

grep -RIn "systemd-guide" .

A redirect is a safety net. Updating the internal link is the real fix. Otherwise your site is still telling visitors to walk through the redirect revolving door.

Automation

Run the checker regularly with cron

For a small site, running the checker once a week is usually enough. Save the script somewhere sensible, then add a cron job.

crontab -e

Example weekly check every Monday at 07:30:

30 7 * * 1 cd /home/youruser/seo-tools && ./site-404-checker.sh yourdomain.com >> 404-check.log 2>&1

If you prefer systemd timers, see the systemd services and timers guide.

Tips

Useful improvements you can add later

  • Add a --max-pages argument instead of editing MAX_PAGES inside the script.
  • Add an allowlist for URLs you intentionally ignore.
  • Send the summary by email after a cron run.
  • Save timestamped reports such as site-404-report-2026-05-08.csv.
  • Track the number of broken backlinks over time to make sure site quality is improving.
FAQ

FAQ: Bash 404 checker

Is this the same as a full SEO crawler?

No. It is a lightweight Bash script for checking sitemap URLs and internal links. A full crawler will provide more data, but this is fast, transparent and easy to run from a Linux shell.

What is a broken backlink in this script?

In this script, a broken backlink means an internal link found on one of your pages points to a URL that returns a bad status, usually 404.

Why does the script ignore /cdn-cgi/?

Cloudflare can add email protection URLs under /cdn-cgi/. They are not normal content pages, so the checker skips them to avoid false positives.

Can I use this on any website?

Yes, as long as the site has a sitemap or you provide a URL list. Be polite with crawl frequency and avoid hammering sites you do not own.

Next steps

Keep improving your Bash SEO toolkit

This 404 checker pairs nicely with a title and meta description audit, a Bash uptime monitor and log searching scripts. Together, they give you a practical command-line toolkit for keeping a technical site healthy.

$ practise_next --topic find

Practise this next

Turn the guide into practice with a related quiz, builder, cheat sheet or learning path.