Datasets Prices Documentation Blog

Best Ways to Find All URLs on Any Website

Valentina Skakun Valentina Skakun
Last update: 19 May 2025

Find all URLs on a domain by using a site crawler, parsing the sitemap file, exploring robots.txt, applying search engine queries with operators, or writing a custom scraping script. Each method provides different levels of control and depth, depending on your technical skills and data access needs. 

How to Find All URLs on a Domain

There are five main ways to get all the links from a site:

  1. Website Crawlers. Use a ready-made crawler that scans the whole site and lists all the links it finds.
  2. Sitemaps & robots.txt. If the site has a sitemap.xml, you can pull links directly from there.
  3. SEO Tools. Many SEO tools come with built-in features to collect site links.
  4. Search Engine Queries. If you only need links that match a specific pattern, you can scrape them from search engine results.
  5. Write Your Own Script. This case is ideal for developers or technical users who require custom scraping logic, precise control over link extraction, or integration with specific tools.

We’ll go through each method step by step. 

Method 1: Website Crawler

The most reliable way to collect all URLs from a website is to use a crawler. It doesn’t rely on a sitemap and ignores most site protections since a specialized tool does the crawling.

For this example, we’ll use HasData’s web crawler, which is available after you sign up. You can find it in your dashboard under no-code scrapers.

To run it, fill in the main fields:

  1. Limit. Maximum number of pages to crawl (0 = no limit).
  2. URLs. Starting URLs.
  3. maxDepth. How many link levels to follow from each starting URL.

Optionally, you can set RegEx patterns to include or skip certain paths. You can also choose the output format.

Once launched, the task will start crawling the site (or multiple sites), and you just need to wait for it to finish. You can track progress on the right side of the screen. After the crawl finishes, you can download the output file.

If you prefer using a script or want to process and extract data from each page using AI, check out Method 5, where we cover that.

Method 2: Sitemaps & robots.txt

If the website has a sitemap that lists all its URLs, you can parse it. But keep in mind, not every site has a sitemap.

Using robots.txt

Unlike the sitemap, robots.txt is always in the root folder and named exactly that: 

http://demo.nopcommerce.com/robots.txt

Robots.txt holds data for bots visiting a website, including where the sitemap is and what it’s called.

However, many developers don’t add the sitemap path there, which makes it much harder to find.

Locating Sitemaps

If you don’t see a sitemap link in robots.txt, you can try to find it manually. In most cases, the sitemap is located at the domain’s root. But webmasters often split it by topic or compress it to save bandwidth. Based on the name and type, sitemaps usually fall into these categories:

  • Root sitemap. By convention, it’s usually here:
    https://domain.com/sitemap.xml
  • Index sitemap. On larger sites, you’ll often see sitemap_index.xml or sitemap-index.xml, which point to multiple smaller sitemaps.
  • Specialized sitemaps. Big e-commerce or news sites may have:
    sitemap-products.xml, sitemap-news.xml, sitemap-images.xml, etc.
  • Compressed sitemaps. It’s common to use .gz for compression:
    sitemap.xml.gz
  • Custom names. Technically, any name or extension is allowed. But the path must be listed in robots.txt or submitted through Webmaster Tools.

In general, start by checking the default sitemap URL.

Parsing Sitemap XML

Let’s say you’ve found the sitemap. Now you need to parse it to extract the URLs. A sitemap usually looks like this:

You need to extract everything inside the <loc>...</loc> tags. You can use regular expressions or convert XML to CSV, whatever works best for you. I find it easiest to write a small Python script that scrapes the sitemap, extracts the links, and saves them to a TXT file.

We’ll need the requests library and the built-in XML parser:

import requests
import xml.etree.ElementTree as ET

Next, set the sitemap URL and the output file:

sitemap_url = "https://demo.nopcommerce.com/sitemap.xml"
output_file = "sitemap_links.txt"

Send a request to get the sitemap:

response = requests.get(sitemap_url)
response.raise_for_status()

Parse the XML and extract all the <loc> links:

root = ET.fromstring(response.content)
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
links = [loc.text for loc in root.findall('.//ns:loc', namespace)]

Save the links to a file:

with open(output_file, 'w', encoding='utf-8') as f:
    for link in links:
        f.write(link + '\n')

The output file will look like this:

Full script:

import requests
import xml.etree.ElementTree as ET


sitemap_url = "https://demo.nopcommerce.com/sitemap.xml"
output_file = "sitemap_links.txt"


response = requests.get(sitemap_url)
response.raise_for_status()


root = ET.fromstring(response.content)
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
links = [loc.text for loc in root.findall('.//ns:loc', namespace)]


with open(output_file, 'w', encoding='utf-8') as f:
    for link in links:
        f.write(link + '\n')

If you run into issues fetching the sitemap, the server might be blocking requests. In that case, you can use a web scraping API to get around it.

We’ll use HasData’s web scraping API as an example. You’ll need an API key, which you can get after signing up.

Since the API response comes in JSON format, let’s import one more library:

import requests
import json
import xml.etree.ElementTree as ET

Now set your HasData API key and the sitemap URL:

api_key = "YOUR-API-KEY"
sitemap_url = "https://demo.nopcommerce.com/sitemap.xml"

Prepare the API request, including the proxy type:

url = "https://api.hasdata.com/scrape/web"


payload = json.dumps({
  "url": sitemap_url,
  "proxyType": "datacenter",
  "proxyCountry": "US"
})
headers = {
  'Content-Type': 'application/json',
  'x-api-key': api_key
}

Send the request:

response = requests.request("POST", url, headers=headers, data=payload)

Then handle the response the same way as before:

root = ET.fromstring(response.json().get("content"))
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
links = [loc.text for loc in root.findall('.//ns:loc', namespace)]


with open("output.json", 'w', encoding='utf-8') as f:
    for link in links:
        f.write(link + '\n')

HasData’s API puts the full page content in the “content” field. You can extract links from it just like the previous output file. 

Full script:

import requests
import json
import xml.etree.ElementTree as ET


api_key = "YOUR-API-KEY"
sitemap_url = "https://demo.nopcommerce.com/sitemap.xml"


url = "https://api.hasdata.com/scrape/web"


payload = json.dumps({
  "url": sitemap_url,
  "proxyType": "datacenter",
  "proxyCountry": "US"
})
headers = {
  'Content-Type': 'application/json',
  'x-api-key': api_key
}


response = requests.request("POST", url, headers=headers, data=payload)


root = ET.fromstring(response.json().get("content"))
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
links = [loc.text for loc in root.findall('.//ns:loc', namespace)]


with open("output.json", 'w', encoding='utf-8') as f:
    for link in links:
        f.write(link + '\n')

The result is the same, but this method works even on sites that block direct requests.

Method 3: SEO Tools

Many SEO tools can collect links from a website. But they come with limitations. For example, the free version of Screaming Frog only allows up to 500 URLs.

To collect links with Screaming Frog, download it from the official website, install it, and launch it. Enter your domain and start the crawl:

Wait for the crawl to finish. If needed, export the data to a file:

Remember that if the site has anti-scraping protection, the tool might not reach all pages. In that case, try adjusting the crawl speed, user agents, and other headers. These settings are accessible in the configuration panel. 

The good thing about this method is that it gives you more than just URLs; you also get the status codes of the pages. So, it’s useful if you want to check for things like broken links on your site. 

Method 4: Search-Engine Queries

If the previous examples didn’t work for you, or if you only want to collect specific pages that match certain criteria, you can try using search engine results instead.

The idea is simple: use search operators to get the right links from Google SERP, then scrape the search results. Here are the main operators you can use:

OperatorDescriptionExample Query
site:Limits search to this domainsite:demo.nopcommerce.com
inurl:Word must appear in the URLsite:demo.nopcommerce.com inurl:electronics
intitle:Word must be in the page titlesite:demo.nopcommerce.com intitle:“shopping cart”
intext:Word must appear in the page bodysite:demo.nopcommerce.com intext:“digital camera”
filetype:Filter by file type (e.g. XML, PDF)site:demo.nopcommerce.com filetype:xml
OREither condition can matchsite:demo.nopcommerce.com inurl:books OR inurl:jewelry
” (quotes)Exact phrase matchsite:demo.nopcommerce.com intext:“free shipping”
()Group multiple search termssite:demo.nopcommerce.com (inurl:gift OR inurl:accessories)
-Exclude a termsite:demo.nopcommerce.com -inurl:login
*Wildcard for missing word(s)site:demo.nopcommerce.com intitle:“best * gift”

You can mix and match these operators to filter for the exact pages you need.

Now let’s write a script to extract those links from the SERP. We’ll use HasData’s SERP API for that. You’ll need an API key, which you can find in your dashboard after signing up on our site.

First, import the libraries:

import requests
import json

Then set the API key and query parameters:

api_key = "YOUR-API-KEY"


query = "site:hasdata.com inurl:blog"
location = "Austin,Texas,United States"
device_type = "desktop"
num_results = 100

Set the headers and build the request:

base_url = "https://api.hasdata.com/scrape/google/serp"


params = {
    "q": query,
    "location": location,
    "deviceType": device_type,
    "num": num_results
}


headers = {
    "Content-Type": "application/json",
    "x-api-key": api_key
}

Make the request:

response = requests.get(base_url, headers=headers, params=params)

Extract the links from the organic results:

data = response.json().get("organicResults", [])
urls = [entry["link"] for entry in data if "link" in entry]

Save the results to a file:

output_file = "serp.json"
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(urls, f, indent=2)

This will give you a JSON file with links to indexed pages that match your search criteria.

Full script:

import requests
import json


api_key = "YOUR-API-KEY"


query = "site:hasdata.com inurl:blog"
location = "Austin,Texas,United States"
device_type = "desktop"
num_results = 100


base_url = "https://api.hasdata.com/scrape/google/serp"


params = {
    "q": query,
    "location": location,
    "deviceType": device_type,
    "num": num_results
}


headers = {
    "Content-Type": "application/json",
    "x-api-key": api_key
}


response = requests.get(base_url, headers=headers, params=params)
response.raise_for_status()


data = response.json().get("organicResults", [])
urls = [entry["link"] for entry in data if "link" in entry]


output_file = "serp.json"
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(urls, f, indent=2)

This method won’t give you every page on the site, only the ones that are indexed and match your query, but that’s the point. It’s a filtered, targeted approach.

Method 5: Scrape all URLs with Custom Scripting

As mentioned earlier, in this section, we’ll write a script that grabs all links from a site and processes each of them using an LLM.

This example shows how to process pages with LLM during crawling. Here’s a ready-to-use script that crawls a site and processes each page with AI:

import time
import requests
import json


API_KEY = 'YOUR-API-KEY'


headers = {
    'x-api-key': API_KEY,
    'Content-Type': 'application/json'
}


def start_crawl():
    payload = {
        "limit": 50,
        "urls": ["https://demo.nopcommerce.com"],
        "aiExtractRules": {
            "products": {
                "type": "list",
                "output": {
                    "title": {
                        "description": "title of product",
                        "type": "string"
                    },
                    "price": {
                        "description": "price of the product",
                        "type": "string"
                    } 
                }
            }
        },
        "outputFormat": ["json"]
    }


    response = requests.post(
        f"https://api.hasdata.com/scrapers/crawler/jobs",
        json=payload,
        headers=headers
    )
    response.raise_for_status()
    job_id = response.json().get("id")
    print(f"Started job with ID: {job_id}")
    return job_id


def poll_job(job_id):
    while True:
        response = requests.get(
            f"https://api.hasdata.com/scrapers/jobs/{job_id}",
            headers=headers
        )
        data = response.json()
        status = data.get("status")
        print(f"Job status: {status}")


        if status in ["finished", "failed", "cancelled"]:
            return data
        time.sleep(10)


def download_and_extract_urls(json_url,job_id):
    output_path = f"results_{job_id}.json"
    response = requests.get(json_url)
    response.raise_for_status()
    raw_data = response.json()
    urls = [entry["url"] for entry in raw_data if "url" in entry]
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(urls, f, indent=2)


    print(f"Saved to {output_path}")


def download_and_extract_ai_data(json_url, job_id):
    output_path = f"parsed_ai_results_{job_id}.json"
    response = requests.get(json_url)
    response.raise_for_status()
    raw_data = response.json()


    parsed = []
    for entry in raw_data:
        result = {"url": entry.get("url")}
        ai_raw = entry.get("aiResponse")
        if ai_raw:
            try:
                ai_data = json.loads(ai_raw)
                result.update(ai_data)
            except json.JSONDecodeError:
                result["ai_parse_error"] = True
        parsed.append(result)


    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(parsed, f, indent=2, ensure_ascii=False)


    print(f"Saved parsed AI data to {output_path}")



job_id = start_crawl()
result = poll_job(job_id)


json_link = result["data"]["json"]
download_and_extract_urls(json_link,job_id)
download_and_extract_ai_data(json_link, job_id)

Add your HasData API key and define your LLM extraction rules, then run the script.

The script has four functions:

  1. start_crawl(). Sends a job to HasData to crawl all links on a site with your parameters.
  2. poll_job(). Keeps checking the job status until it’s finished.
  3. download_and_extract_urls(). Parses the JSON response and saves only the found URLs.
  4. download_and_extract_ai_data(). Parses the JSON and saves only the LLM output.

We split the extraction into two functions in case you only need URLs or only the AI results.

Let’s break down what each function does. First, we import the libraries and define constants like the API key and headers:

import time
import requests
import json


API_KEY = 'YOUR-API-KEY'


headers = {
    'x-api-key': API_KEY,
    'Content-Type': 'application/json'
}

Then we add placeholders for the main functions:

def start_crawl():
    # Code will be here
    # Function triggering a product crawl job via API
    pass

def poll_job():
    # Code will be here
    # Function polling crawl job status until completion
    pass

def download_and_extract_urls():
    # Code will be here
    # Function downloading crawl results and extracting URLs
    pass

def download_and_extract_ai_data():
    # Code will be here
    # Function downloading and parsing AI-extracted crawl data
    pass

We’ll call them later, after we define what arguments they’ll need.

Let’s start with start_crawl(). This one sends the request to crawl the site and returns the job ID:

def start_crawl():
    # Code will be here
    return job_id


job_id = start_crawl()

The function has three parts. Setting the payload (including the URL and LLM rules):

    payload = {
        "limit": 20,
        "urls": ["https://demo.nopcommerce.com"],
        "aiExtractRules": {
            "products": {
                "type": "list",
                "output": {
                    "title": {
                        "description": "title of product",
                        "type": "string"
                    },
                    "description": {
                        "description": "information about the product",
                        "type": "string"
                    },
                    "price": {
                        "description": "price of the product",
                        "type": "string"
                    }
                }
            }
        },
        "outputFormat": ["text", "json"]
    }

Sending the request:

    response = requests.post(
        f"https://api.hasdata.com/scrapers/crawler/jobs",
        json=payload,
        headers=headers
    )
    response.raise_for_status()

And getting the job ID:

    job_id = response.json().get("id")

Once we have the job ID, we can track the job status. The function below runs in a loop until the job is “finished”, “failed”, or “cancelled”:

def poll_job(job_id):
    while True:
        # Code will be here
        if status in ["finished", "failed", "cancelled"]:
            return data
        time.sleep(10)


result = poll_job(job_id)

Here’s the request for checking the status and printing it to the console:

        response = requests.get(
            f"https://api.hasdata.com/scrapers/jobs/{job_id}",
            headers=headers
        )
        data = response.json()
        status = data.get("status")
        print(f"Job status: {status}")

After that, you’ll get download links for the result files (in our case, JSON and text):

json_link = result["data"]["json"]
text_link = result["data"]["text"]

That’s enough to stop here, but let’s go further and save the links and the AI results to files. 

First, save URLs to a file:

def download_and_extract_urls(json_url, job_id):
    # Code will be here


download_and_extract_urls(json_link, job_id)

We pass in the JSON URL and job ID to avoid file name conflicts:

    output_path = f"results_{job_id}.json"
    response = requests.get(json_url)
    response.raise_for_status()
    raw_data = response.json()

Then we extract all the URLs:

    urls = [entry["url"] for entry in raw_data if "url" in entry]

And save them:

    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(urls, f, indent=2)

Then save LLM results. This part is similar, but here we need to parse the AI output:

def download_and_extract_ai_data(json_url, job_id):
    output_path = f"parsed_ai_results_{job_id}.json"
    response = requests.get(json_url)
    response.raise_for_status()
    raw_data = response.json()


    # Code will be here
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(parsed, f, indent=2, ensure_ascii=False)


download_and_extract_ai_data(json_link, job_id)

It should look like this:

 

Go through the raw AI results and convert them into a structured format:

    parsed = []
    for entry in raw_data:
        result = {"url": entry.get("url")}
        ai_raw = entry.get("aiResponse")
        if ai_raw:
            try:
                ai_data = json.loads(ai_raw)
                result.update(ai_data)
            except json.JSONDecodeError:
                result["ai_parse_error"] = True
        parsed.append(result)

The final file looks like this:


This method saves you from having to search the links again just to extract data. You describe what you want, and the LLM handles the rest, no need to write selectors.

Conclusion

We covered five straightforward methods for finding all URLs on a website. From using ready-made crawlers to exploring sitemaps, working with SEO tools like Screaming Frog, and even how to use search operators (like site:, inurl:, intitle:, intext:), you now have a complete toolkit for URL discovery.

One of the simplest ways to get started is with our Website Crawler, which lets you quickly list all URLs on any site without technical hassle. For those who want more control, our Web Scraping API can be a game-changer.

Whether you’re a beginner or a developer, these methods give you the flexibility to find and collect URLs however you prefer. We’ve included working scripts for each method, give them a try and see which one works best for your next project.

Valentina Skakun
Valentina Skakun
I'm a technical writer who believes that data parsing can help in getting and analyzing data. I'll tell about what parsing is and how to use it.
Blog

Might Be Interesting