Introduction
We covered discovering and enumerating hidden content on any website. This room was part of TryHackMe Junior Penetration tester pathway.
Introduction to Content Discovery
Firstly, we should ask, in the context of web application security, what is content? Content can be many things, a file, video, picture, backup, a website feature. When we talk about content discovery, we’re not talking about the obvious things we can see on a website; it’s the things that aren’t immediately presented to us and that weren’t always intended for public access.
This content could be, for example, pages or portals intended for staff usage, older versions of the website, backup files, configuration files, administration panels, etc.
There are three main ways of discovering content on a website which we’ll cover. Manually, Automated and OSINT (Open-Source Intelligence).
Certified Security Blue Team Level 1 Study Notes
Enumeration Techniques
robots.txt: This file specifies directories that web crawlers should ignore. Accessing it may reveal sensitive paths (e.g., /staff_portal
) that the site’s creators intended to keep private.
sitemap.xml: Used by search engines for indexing, this file lists URLs on the website, providing an overview of its structure, including hidden or internal pages.
Favicon Analysis: The site’s favicon can hint at the underlying framework or CMS. By hashing the favicon and comparing it with known hashes, the website’s software stack can sometimes be identified.
Examining HTTP Headers:
- HTTP headers reveal critical information about the server, such as the server type, version, and any unique custom headers (e.g., an
X-Flag
header used as a challenge flag). Headers also provide insights into the server’s technology stack and any potential security misconfigurations.
Manual Framework Identification:
- Framework details can sometimes be found in comments, footer text, or metadata within the HTML source code, such as the mention of a “THM Framework.” Often, documentation or default credentials (e.g.,
admin:admin
) are accessible, and using these defaults can reveal sensitive sections of the application.
Using Google Dorks:
- Google Dorking: Google dorks enhance content discovery by filtering search results based on specific parameters:
site:example.com
: Lists indexed pages under a specific domain.inurl:keyword
: Retrieves pages where URLs contain a specific keyword.
- These methods help locate publicly accessible but potentially sensitive URLs, directories, and files.
Room Answers | TryHackMe Content Discovery
What is the directory in the robots.txt that isn’t allowed to be viewed by web crawlers?
What is the name of the log file that was discovered?