Web content wordlists
Note: The original article was posted on my company blog https://blog.sec-it.fr/.
Perimeter discovery is an important step during a web pentest and can, in some cases, lead to a website compromise. In order to carry out this recognition, several tools are available, including web content wordlists for web fuzzing:
|Name||First release||Last Update||Max Size (lines)|
|Assetnote wordlists||2020/11/16||2021/01/28||4.319.406 (httparchive_js_2020_11_18.txt)|
|Dirb wordlists||2015/06/16||2015/06/16||20.469 (big.txt)|
|DirBuster wordlists||2013/05/01||2013/05/01||220.560 (directory-list-2.3-medium.txt)|
|Dirsearch dicc.txt||2013/05/22||2021/02/10||9.021 (dicc.txt)|
|Wfuzz wordlists||2014/10/23||2019/03/14||45.459 (megabeast.txt)|
* this post has been written in Feb. 2021
Note that this post only includes routes, files and folder wordlists. Therefore, wordlists which include passwords such as
rockyou.txt will not be covered.
SecLists is a collection of multiple types of wordlists, including usernames, passwords, URLs, sensitive data patterns, fuzzing payloads, web shells, and many more.
SecLists is the security tester’s companion. […] The goal is to enable a security tester to pull this repository onto a new testing box and have access to every type of list that may be needed.
The repository is actively maintained and its last commit is less than two weeks ago. The package is provided by most of pentesting Linux releases such as Black Arch and Kali Linux.
Covered wordlists are located into
Discovery/Web-Content/. We can notice that there is a lot of available wordlists (121 in the main folder). Some of them are specific for a given technology (
oracle.txt …), others are specific for a given language (
common-and-dutch.txt …). The main wordlist family present in SecList is the “RAFT Word Lists”.
RAFT wordlists has been generated from
robots.txt from 1.7 million websites and were originally provided by RAFT Tool in 2011. In this family, wordlists are separated as follows :
- 4 families (directories, extensions, files and words)
- 3 sizes per family (large, medium and small)
- 2 case options (normal and lowercase)
|Name||Size (lines) large||Size (lines) medium||Size (lines) small|
Looking at raft-*-files.txt, we got the following extension repartition :
SecLists also includes wordlists provided with dirbuster and dirb, covered in the rest of this post.
Assetnote is a company that provides security tools and services to measure exposure to external attack. The company also provides a repository named Assetnote Wordlist.
Theses wordlists are generated monthly using Google BigQuery datasets with their GO client named commonspeak2, and results in content discovery and subdomain wordlists.
As these datasets are updated on a regular basis, the wordlists generated via Commonspeak2 reflect the current technologies used on the web.
Wordlists are generated per technologies, for this post we will focus on directories, API routes and PHP, ASP.NET, JSP/JSPA languages.
Note : As January 2021 wordlists seems less complete than previous wordlists, and February 2021 wordlists not available at this time, we will focus in November 2020 wordlists.
|Assetnote Directories||Assetnote API routes|
_ are considered as a wildcard in the previous graph.
Dirb is a web discovery tool already covered in a previous post. The tool is provided with multiple wordlists including more common ones:
|common.txt (default wordlist for dirb)||4.614|
|Charsets in dirb family.|
Those wordlist doesn’t have any extensions and only 2% of the words contain capital letters. You can also note that there is more “other” charsets in
common.txt than in
DirBuster is a web discovery tool that has also been covered in a previous post. The tool is provided with multiple wordlists including
directory-list-2.3 wordlists family.
Some packaged versions may not include directory-list-2.3-big.txt.
Such as dirb wordlists, directory-list-2.3 doesn’t include any extensions.
|Charsets in directory-list-2.3 family.|
_ are considered as a wildcard in the previous graph.
dicc.txt is a wordlist provided with dirsearch tool. The wordlist has the particularity to provide the variable extension
%EXT%. Therefore, the wordlist must be used with tools that support
%EXT% format (see post about web discovery tools).
The wordlist has a total of 9021 lines distributed as follows :
You can note that there is “only” 500 words containing
Wfuzz tool is provided with a lot of wordlists. Some of them in “general” directory are dedicated for directories and files enumeration. That’s the case of
common.txt. None of those wordlist have words containing extensions. They are distributed as follows :
|Charsets in wfuzz family.|
In some case, an auditor may look for a specific wordlist. Wordlistctl is a tool design to fetch, install, update and search for a given wordlists. This python script offers more than 6400 wordlists and is maintained by BlackArch Linux distribution.
$ wordlistctl search wordpress --==[ wordlistctl by blackarch.org ]==-- > wordpress (29.20 Kb) > urls-wordpress-3 (36.62 Kb) > wordpress-attacks-july2014 (88.00 B) > wordpress_usernames (541.57 Mb) > wordpress_attacks_july2014 (88 B) $ wordlistctl fetch -l urls-wordpress-3 --==[ wordlistctl by blackarch.org ]==-- [*] downloading urls-wordpress-3.3.1.txt to /usr/share/wordlists/discovery/urls-wordpress-3.3.1.txt.part [+] downloading urls-wordpress-3.3.1.txt completed
Without further ado, here is a comparative table of the different wordlists discussed in this post. Colored cases represent a high correlation between wordlists. To understand the matrix you should read:
"N% of the wordlist at line Y is contained in wordlist at column X".
I.E.: 87% of wordlist n°17 (dirb - small) is contained in wordlist n°0 (seclists - raft-large-files).
The sources used to generate this chart are available on this repository: sec-it/WL-Comparison.
An interactive version of the chart is available online.
The original article was published on my company’s blog https://blog.sec-it.fr/.
You can find SEC-IT at the address https://www.sec-it.fr.
Part of this content is under MIT License / © 2021 SEC-IT.