Skip to main content
FME Hub user andreas_h just uploaded a new transformer to the FME Hub.

StaticWebCrawler

This Custom Transformer is a web crawler designed to extract links from anchor tags and iframes within a specified domain. It respects common guidelines for web crawlers, ensuring ethical crawling by respecting robots.txt rules. The crawler can also discover and utilize sitemap XML files to prioritize crawling based on URL priority values.

Key Features:

Robots.txt Compliance: Automatically fetches and respects Disallow and Allow rules from a domain's robots.txt file.

Sitemap Integration: Discovers sitemaps from robots.txt and uses them to prioritize URLs for crawling based on their <priority> value.

Domain Scoping: Restricts crawling to a specific domain to prevent the crawler from wandering.

Configurable Limits: Set a maximum number of pages to crawl and a delay between requests to be respectful to servers.

Comprehensive Output: Provides detailed attributes for each discovered link, including its source, type (anchor, iframe, or sitemap), and metadata from sitemaps.

Input Attributes:

Required:

target_url (string): The starting URL for the crawl.

search_domain (string): The domain to confine the crawl to (e.g., "example.com").

Optional:

max_iterations (integer, default: 10): The maximum number of pages to visit.

delay_seconds (float, default: 1.0): The delay between HTTP requests.

user_agent (string, default: '*'): The user-agent string to use for robots.txt checks.

respect_robots (boolean, default: True): Set to False to disable robots.txt checks.

use_sitemaps (boolean, default: False): Set to True to enable sitemap discovery and prioritized crawling.

Output Attributes:

Each output feature represents a discovered link and will contain the following attributes:

link: The discovered URL.

source_url: The URL where the link was found.

link_type: The type of link (anchor, iframe, or sitemap).

crawl_order: The order in which the link was discovered.

priority: The priority of the link from the sitemap (if applicable).

lastmod: The last modification date from the sitemap (if applicable).

changefreq: The change frequency from the sitemap (if applicable).



Would you like to know more? Click here to find out more details!
Be the first to reply!