Introduction

We covered discovering and enumerating hidden content on any website. This room was part of TryHackMe Junior Penetration tester pathway.

Introduction to Content Discovery

Firstly, we should ask, in the context of web application security, what is content? Content can be many things, a file, video, picture, backup, a website feature. When we talk about content discovery, we’re not talking about the obvious things we can see on a website; it’s the things that aren’t immediately presented to us and that weren’t always intended for public access.

This content could be, for example, pages or portals intended for staff usage, older versions of the website, backup files, configuration files, administration panels, etc.

There are three main ways of discovering content on a website which we’ll cover. Manually, Automated and OSINT (Open-Source Intelligence).

OSCP Study Notes

Certified Security Blue Team Level 1 Study Notes

Enumeration Techniques

robots.txt: This file specifies directories that web crawlers should ignore. Accessing it may reveal sensitive paths (e.g., /staff_portal) that the site’s creators intended to keep private.

sitemap.xml: Used by search engines for indexing, this file lists URLs on the website, providing an overview of its structure, including hidden or internal pages.

Favicon Analysis: The site’s favicon can hint at the underlying framework or CMS. By hashing the favicon and comparing it with known hashes, the website’s software stack can sometimes be identified.

Examining HTTP Headers:

  • HTTP headers reveal critical information about the server, such as the server type, version, and any unique custom headers (e.g., an X-Flag header used as a challenge flag). Headers also provide insights into the server’s technology stack and any potential security misconfigurations.

Manual Framework Identification:

  • Framework details can sometimes be found in comments, footer text, or metadata within the HTML source code, such as the mention of a “THM Framework.” Often, documentation or default credentials (e.g., admin:admin) are accessible, and using these defaults can reveal sensitive sections of the application.

Using Google Dorks:

  • Google Dorking: Google dorks enhance content discovery by filtering search results based on specific parameters:
    • site:example.com: Lists indexed pages under a specific domain.
    • inurl:keyword: Retrieves pages where URLs contain a specific keyword.
  • These methods help locate publicly accessible but potentially sensitive URLs, directories, and files.

Room Answers | TryHackMe Content Discovery

What is the directory in the robots.txt that isn’t allowed to be viewed by web crawlers?

 
What framework did the favicon belong to?
What is the path of the secret area that can be found in the sitemap.xml file?
What is the flag value from the X-FLAG header?
What is the flag from the framework’s administration portal?
What Google dork operator can be used to only show results from a particular site?
What online tool can be used to identify what technologies a website is running?
What is the website address for the Wayback Machine?
What is Git?
What URL format do Amazon S3 buckets end in?
What is the name of the directory beginning “/mo….” that was discovered?

What is the name of the log file that was discovered?

 
Video Walk-through

 

 

About the Author

Mastermind Study Notes is a group of talented authors and writers who are experienced and well-versed across different fields. The group is led by, Motasem Hamdan, who is a Cybersecurity content creator and YouTuber.

View Articles