Best Ways to Find All URLs on Any Website

Find all URLs on a domain by using a site crawler, parsing the sitemap file, exploring robots.txt, applying search engine queries with operators, or writing a custom scraping script. Each method provides different levels of control and depth, depending on your technical skills and data access needs.
How to Find All URLs on a Domain
There are five main ways to get all the links from a site:
- Website Crawlers. Use a ready-made crawler that scans the whole site and lists all the links it finds.
- Sitemaps & robots.txt. If the site has a sitemap.xml, you can pull links directly from there.
- SEO Tools. Many SEO tools come with built-in features to collect site links.
- Search Engine Queries. If you only need links that match a specific pattern, you can scrape them from search engine results.
- Write Your Own Script. This case is ideal for developers or technical users who require custom scraping logic, precise control over link extraction, or integration with specific tools.
We’ll go through each method step by step.
Method 1: Website Crawler
The most reliable way to collect all URLs from a website is to use a crawler. It doesn’t rely on a sitemap and ignores most site protections since a specialized tool does the crawling.
For this example, we’ll use HasData’s web crawler, which is available after you sign up. You can find it in your dashboard under no-code scrapers.
To run it, fill in the main fields:
- Limit. Maximum number of pages to crawl (0 = no limit).
- URLs. Starting URLs.
- maxDepth. How many link levels to follow from each starting URL.
Optionally, you can set RegEx patterns to include or skip certain paths. You can also choose the output format.
Once launched, the task will start crawling the site (or multiple sites), and you just need to wait for it to finish. You can track progress on the right side of the screen. After the crawl finishes, you can download the output file.
If you prefer using a script or want to process and extract data from each page using AI, check out Method 5, where we cover that.
Method 2: Sitemaps & robots.txt
If the website has a sitemap that lists all its URLs, you can parse it. But keep in mind, not every site has a sitemap.
Using robots.txt
Unlike the sitemap, robots.txt is always in the root folder and named exactly that:
http://demo.nopcommerce.com/robots.txt
Robots.txt holds data for bots visiting a website, including where the sitemap is and what it’s called.
However, many developers don’t add the sitemap path there, which makes it much harder to find.
Locating Sitemaps
If you don’t see a sitemap link in robots.txt, you can try to find it manually. In most cases, the sitemap is located at the domain’s root. But webmasters often split it by topic or compress it to save bandwidth. Based on the name and type, sitemaps usually fall into these categories:
- Root sitemap. By convention, it’s usually here:
https://domain.com/sitemap.xml - Index sitemap. On larger sites, you’ll often see sitemap_index.xml or sitemap-index.xml, which point to multiple smaller sitemaps.
- Specialized sitemaps. Big e-commerce or news sites may have:
sitemap-products.xml, sitemap-news.xml, sitemap-images.xml, etc. - Compressed sitemaps. It’s common to use .gz for compression:
sitemap.xml.gz - Custom names. Technically, any name or extension is allowed. But the path must be listed in robots.txt or submitted through Webmaster Tools.
In general, start by checking the default sitemap URL.
Parsing Sitemap XML
Let’s say you’ve found the sitemap. Now you need to parse it to extract the URLs. A sitemap usually looks like this:
You need to extract everything inside the <loc>...</loc>
tags. You can use regular expressions or convert XML to CSV, whatever works best for you. I find it easiest to write a small Python script that scrapes the sitemap, extracts the links, and saves them to a TXT file.
We’ll need the requests library and the built-in XML parser:
import requests
import xml.etree.ElementTree as ET
Next, set the sitemap URL and the output file:
sitemap_url = "https://demo.nopcommerce.com/sitemap.xml"
output_file = "sitemap_links.txt"
Send a request to get the sitemap:
response = requests.get(sitemap_url)
response.raise_for_status()
Parse the XML and extract all the <loc>
links:
root = ET.fromstring(response.content)
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
links = [loc.text for loc in root.findall('.//ns:loc', namespace)]
Save the links to a file:
with open(output_file, 'w', encoding='utf-8') as f:
for link in links:
f.write(link + '\n')
The output file will look like this:
Full script:
import requests
import xml.etree.ElementTree as ET
sitemap_url = "https://demo.nopcommerce.com/sitemap.xml"
output_file = "sitemap_links.txt"
response = requests.get(sitemap_url)
response.raise_for_status()
root = ET.fromstring(response.content)
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
links = [loc.text for loc in root.findall('.//ns:loc', namespace)]
with open(output_file, 'w', encoding='utf-8') as f:
for link in links:
f.write(link + '\n')
If you run into issues fetching the sitemap, the server might be blocking requests. In that case, you can use a web scraping API to get around it.
We’ll use HasData’s web scraping API as an example. You’ll need an API key, which you can get after signing up.
Since the API response comes in JSON format, let’s import one more library:
import requests
import json
import xml.etree.ElementTree as ET
Now set your HasData API key and the sitemap URL:
api_key = "YOUR-API-KEY"
sitemap_url = "https://demo.nopcommerce.com/sitemap.xml"
Prepare the API request, including the proxy type:
url = "https://api.hasdata.com/scrape/web"
payload = json.dumps({
"url": sitemap_url,
"proxyType": "datacenter",
"proxyCountry": "US"
})
headers = {
'Content-Type': 'application/json',
'x-api-key': api_key
}
Send the request:
response = requests.request("POST", url, headers=headers, data=payload)
Then handle the response the same way as before:
root = ET.fromstring(response.json().get("content"))
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
links = [loc.text for loc in root.findall('.//ns:loc', namespace)]
with open("output.json", 'w', encoding='utf-8') as f:
for link in links:
f.write(link + '\n')
HasData’s API puts the full page content in the “content” field. You can extract links from it just like the previous output file.
Full script:
import requests
import json
import xml.etree.ElementTree as ET
api_key = "YOUR-API-KEY"
sitemap_url = "https://demo.nopcommerce.com/sitemap.xml"
url = "https://api.hasdata.com/scrape/web"
payload = json.dumps({
"url": sitemap_url,
"proxyType": "datacenter",
"proxyCountry": "US"
})
headers = {
'Content-Type': 'application/json',
'x-api-key': api_key
}
response = requests.request("POST", url, headers=headers, data=payload)
root = ET.fromstring(response.json().get("content"))
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
links = [loc.text for loc in root.findall('.//ns:loc', namespace)]
with open("output.json", 'w', encoding='utf-8') as f:
for link in links:
f.write(link + '\n')
The result is the same, but this method works even on sites that block direct requests.
Method 3: SEO Tools
Many SEO tools can collect links from a website. But they come with limitations. For example, the free version of Screaming Frog only allows up to 500 URLs.
To collect links with Screaming Frog, download it from the official website, install it, and launch it. Enter your domain and start the crawl:
Wait for the crawl to finish. If needed, export the data to a file:
Remember that if the site has anti-scraping protection, the tool might not reach all pages. In that case, try adjusting the crawl speed, user agents, and other headers. These settings are accessible in the configuration panel.
The good thing about this method is that it gives you more than just URLs; you also get the status codes of the pages. So, it’s useful if you want to check for things like broken links on your site.
Method 4: Search-Engine Queries
If the previous examples didn’t work for you, or if you only want to collect specific pages that match certain criteria, you can try using search engine results instead.
The idea is simple: use search operators to get the right links from Google SERP, then scrape the search results. Here are the main operators you can use:
Operator | Description | Example Query |
---|---|---|
site: | Limits search to this domain | site:demo.nopcommerce.com |
inurl: | Word must appear in the URL | site:demo.nopcommerce.com inurl:electronics |
intitle: | Word must be in the page title | site:demo.nopcommerce.com intitle:“shopping cart” |
intext: | Word must appear in the page body | site:demo.nopcommerce.com intext:“digital camera” |
filetype: | Filter by file type (e.g. XML, PDF) | site:demo.nopcommerce.com filetype:xml |
OR | Either condition can match | site:demo.nopcommerce.com inurl:books OR inurl:jewelry |
” (quotes) | Exact phrase match | site:demo.nopcommerce.com intext:“free shipping” |
() | Group multiple search terms | site:demo.nopcommerce.com (inurl:gift OR inurl:accessories) |
- | Exclude a term | site:demo.nopcommerce.com -inurl:login |
* | Wildcard for missing word(s) | site:demo.nopcommerce.com intitle:“best * gift” |
You can mix and match these operators to filter for the exact pages you need.
Now let’s write a script to extract those links from the SERP. We’ll use HasData’s SERP API for that. You’ll need an API key, which you can find in your dashboard after signing up on our site.
First, import the libraries:
import requests
import json
Then set the API key and query parameters:
api_key = "YOUR-API-KEY"
query = "site:hasdata.com inurl:blog"
location = "Austin,Texas,United States"
device_type = "desktop"
num_results = 100
Set the headers and build the request:
base_url = "https://api.hasdata.com/scrape/google/serp"
params = {
"q": query,
"location": location,
"deviceType": device_type,
"num": num_results
}
headers = {
"Content-Type": "application/json",
"x-api-key": api_key
}
Make the request:
response = requests.get(base_url, headers=headers, params=params)
Extract the links from the organic results:
data = response.json().get("organicResults", [])
urls = [entry["link"] for entry in data if "link" in entry]
Save the results to a file:
output_file = "serp.json"
with open(output_file, "w", encoding="utf-8") as f:
json.dump(urls, f, indent=2)
This will give you a JSON file with links to indexed pages that match your search criteria.
Full script:
import requests
import json
api_key = "YOUR-API-KEY"
query = "site:hasdata.com inurl:blog"
location = "Austin,Texas,United States"
device_type = "desktop"
num_results = 100
base_url = "https://api.hasdata.com/scrape/google/serp"
params = {
"q": query,
"location": location,
"deviceType": device_type,
"num": num_results
}
headers = {
"Content-Type": "application/json",
"x-api-key": api_key
}
response = requests.get(base_url, headers=headers, params=params)
response.raise_for_status()
data = response.json().get("organicResults", [])
urls = [entry["link"] for entry in data if "link" in entry]
output_file = "serp.json"
with open(output_file, "w", encoding="utf-8") as f:
json.dump(urls, f, indent=2)
This method won’t give you every page on the site, only the ones that are indexed and match your query, but that’s the point. It’s a filtered, targeted approach.
Method 5: Scrape all URLs with Custom Scripting
As mentioned earlier, in this section, we’ll write a script that grabs all links from a site and processes each of them using an LLM.
This example shows how to process pages with LLM during crawling. Here’s a ready-to-use script that crawls a site and processes each page with AI:
import time
import requests
import json
API_KEY = 'YOUR-API-KEY'
headers = {
'x-api-key': API_KEY,
'Content-Type': 'application/json'
}
def start_crawl():
payload = {
"limit": 50,
"urls": ["https://demo.nopcommerce.com"],
"aiExtractRules": {
"products": {
"type": "list",
"output": {
"title": {
"description": "title of product",
"type": "string"
},
"price": {
"description": "price of the product",
"type": "string"
}
}
}
},
"outputFormat": ["json"]
}
response = requests.post(
f"https://api.hasdata.com/scrapers/crawler/jobs",
json=payload,
headers=headers
)
response.raise_for_status()
job_id = response.json().get("id")
print(f"Started job with ID: {job_id}")
return job_id
def poll_job(job_id):
while True:
response = requests.get(
f"https://api.hasdata.com/scrapers/jobs/{job_id}",
headers=headers
)
data = response.json()
status = data.get("status")
print(f"Job status: {status}")
if status in ["finished", "failed", "cancelled"]:
return data
time.sleep(10)
def download_and_extract_urls(json_url,job_id):
output_path = f"results_{job_id}.json"
response = requests.get(json_url)
response.raise_for_status()
raw_data = response.json()
urls = [entry["url"] for entry in raw_data if "url" in entry]
with open(output_path, "w", encoding="utf-8") as f:
json.dump(urls, f, indent=2)
print(f"Saved to {output_path}")
def download_and_extract_ai_data(json_url, job_id):
output_path = f"parsed_ai_results_{job_id}.json"
response = requests.get(json_url)
response.raise_for_status()
raw_data = response.json()
parsed = []
for entry in raw_data:
result = {"url": entry.get("url")}
ai_raw = entry.get("aiResponse")
if ai_raw:
try:
ai_data = json.loads(ai_raw)
result.update(ai_data)
except json.JSONDecodeError:
result["ai_parse_error"] = True
parsed.append(result)
with open(output_path, "w", encoding="utf-8") as f:
json.dump(parsed, f, indent=2, ensure_ascii=False)
print(f"Saved parsed AI data to {output_path}")
job_id = start_crawl()
result = poll_job(job_id)
json_link = result["data"]["json"]
download_and_extract_urls(json_link,job_id)
download_and_extract_ai_data(json_link, job_id)
Add your HasData API key and define your LLM extraction rules, then run the script.
The script has four functions:
- start_crawl(). Sends a job to HasData to crawl all links on a site with your parameters.
- poll_job(). Keeps checking the job status until it’s finished.
- download_and_extract_urls(). Parses the JSON response and saves only the found URLs.
- download_and_extract_ai_data(). Parses the JSON and saves only the LLM output.
We split the extraction into two functions in case you only need URLs or only the AI results.
Let’s break down what each function does. First, we import the libraries and define constants like the API key and headers:
import time
import requests
import json
API_KEY = 'YOUR-API-KEY'
headers = {
'x-api-key': API_KEY,
'Content-Type': 'application/json'
}
Then we add placeholders for the main functions:
def start_crawl():
# Code will be here
# Function triggering a product crawl job via API
pass
def poll_job():
# Code will be here
# Function polling crawl job status until completion
pass
def download_and_extract_urls():
# Code will be here
# Function downloading crawl results and extracting URLs
pass
def download_and_extract_ai_data():
# Code will be here
# Function downloading and parsing AI-extracted crawl data
pass
We’ll call them later, after we define what arguments they’ll need.
Let’s start with start_crawl(). This one sends the request to crawl the site and returns the job ID:
def start_crawl():
# Code will be here
return job_id
job_id = start_crawl()
The function has three parts. Setting the payload (including the URL and LLM rules):
payload = {
"limit": 20,
"urls": ["https://demo.nopcommerce.com"],
"aiExtractRules": {
"products": {
"type": "list",
"output": {
"title": {
"description": "title of product",
"type": "string"
},
"description": {
"description": "information about the product",
"type": "string"
},
"price": {
"description": "price of the product",
"type": "string"
}
}
}
},
"outputFormat": ["text", "json"]
}
Sending the request:
response = requests.post(
f"https://api.hasdata.com/scrapers/crawler/jobs",
json=payload,
headers=headers
)
response.raise_for_status()
And getting the job ID:
job_id = response.json().get("id")
Once we have the job ID, we can track the job status. The function below runs in a loop until the job is “finished”, “failed”, or “cancelled”:
def poll_job(job_id):
while True:
# Code will be here
if status in ["finished", "failed", "cancelled"]:
return data
time.sleep(10)
result = poll_job(job_id)
Here’s the request for checking the status and printing it to the console:
response = requests.get(
f"https://api.hasdata.com/scrapers/jobs/{job_id}",
headers=headers
)
data = response.json()
status = data.get("status")
print(f"Job status: {status}")
After that, you’ll get download links for the result files (in our case, JSON and text):
json_link = result["data"]["json"]
text_link = result["data"]["text"]
That’s enough to stop here, but let’s go further and save the links and the AI results to files.
First, save URLs to a file:
def download_and_extract_urls(json_url, job_id):
# Code will be here
download_and_extract_urls(json_link, job_id)
We pass in the JSON URL and job ID to avoid file name conflicts:
output_path = f"results_{job_id}.json"
response = requests.get(json_url)
response.raise_for_status()
raw_data = response.json()
Then we extract all the URLs:
urls = [entry["url"] for entry in raw_data if "url" in entry]
And save them:
with open(output_path, "w", encoding="utf-8") as f:
json.dump(urls, f, indent=2)
Then save LLM results. This part is similar, but here we need to parse the AI output:
def download_and_extract_ai_data(json_url, job_id):
output_path = f"parsed_ai_results_{job_id}.json"
response = requests.get(json_url)
response.raise_for_status()
raw_data = response.json()
# Code will be here
with open(output_path, "w", encoding="utf-8") as f:
json.dump(parsed, f, indent=2, ensure_ascii=False)
download_and_extract_ai_data(json_link, job_id)
It should look like this:
Go through the raw AI results and convert them into a structured format:
parsed = []
for entry in raw_data:
result = {"url": entry.get("url")}
ai_raw = entry.get("aiResponse")
if ai_raw:
try:
ai_data = json.loads(ai_raw)
result.update(ai_data)
except json.JSONDecodeError:
result["ai_parse_error"] = True
parsed.append(result)
The final file looks like this:
This method saves you from having to search the links again just to extract data. You describe what you want, and the LLM handles the rest, no need to write selectors.
Conclusion
We covered five straightforward methods for finding all URLs on a website. From using ready-made crawlers to exploring sitemaps, working with SEO tools like Screaming Frog, and even how to use search operators (like site:, inurl:, intitle:, intext:), you now have a complete toolkit for URL discovery.
One of the simplest ways to get started is with our Website Crawler, which lets you quickly list all URLs on any site without technical hassle. For those who want more control, our Web Scraping API can be a game-changer.
Whether you’re a beginner or a developer, these methods give you the flexibility to find and collect URLs however you prefer. We’ve included working scripts for each method, give them a try and see which one works best for your next project.
