Using the WASP crawler
The crawler allow you to retrieve tags information by recursively visiting all links from a starting page.
Getting ready for a crawl
Crawling is an intensive task that might take a few minutes up to several hours, depending on the parameters and specific nature of the site being analyzed.
Before launching a crawl:
- Free resources
-
Disable unnecessary Firefox extensions, especially extensions that are looking at the loaded pages. Extensions can grab some memory, CPU and network. Some popular extensions that should be disabled:
- Firebug
- Grease Monkey
- Google Toolbar
- Ad Blocker Plus
- Exit other applications. The more computer resources you can free, the more efficient the crawler will be.
- Site architecture
-
You should have a good understanding of the site to analyze:
- What is the content structure?
- There are how many pages?
- How "deep" is the content?
- How are URLs built? Are pages references with parameters passed on the query string?
- Are pages using templates? If so, maybe you can check only a sample of those pages.
- Are there any secured areas (see bellow)?
- Are there any transactions? (see bellow)
- Secured areas
-
If there are secured areas and you want the crawler to visit those, make sure to authenticate yourself before launching the crawler.
WASP behavior will depend on the type of authentication:
-
Standard HTTP authentication: a typical username/password dialog is shown. WASP will wait for your input unless you have already authenticated before launching the crawler.
-
Form based authentication: WASP do not automaticaly fill forms. WASP will not visit any secured page unless you have already authenticated before launching the crawler.
- Transactions and workflows
-
WASP will not automatically fill forms and submit the data.
In order to test things like a check out process, a subscription or any other transactions and workflows you can use one of those strategies:
Step 1: Introduction
- Crawl from:
-
-
URL address:
Indicate the starting URL address or simply click the "Here" button to start from the currently loaded page.
-
Browse:
If you list URLs in a text file, one per line, and point to that file as the starting point, WASP will check only those URLs.
Also see the WASP Market Research
for a slight variation on this option.
- WASP Tag File:
- When crawling, WASP stores tags information and crawl state in a lite weight database. This field indicates the location of this database file. This value will often be left to its default, but if you want to test multiple websites you might want to use a distinct database for each crawl session.
- Data policy:
-
This parameter can take one of the following values:
-
Flush: Delete existing data from the WTF file and start afresh with a new crawl.
-
Resume: Check for any unscanned pages and resume from there. This option is especially useful if a previous crawl was stopped.
-
Refresh: Rescan the URLs already stored in the WTF file but do not look for new ones.
Step 2: Optimization
- Tools:
-
You can either:
-
Check specific tools: tell the crawler to only check for the tools found on the first page, which will speed up the process.
-
Check for all known tools: look for all known tools, which will take a little longer.
Note that Market Research
forces this option.
- Approximate number of pages to scan:
- This is a wild guess based on the Google search "site:" criteria for the given starting URL. Use this information with a grain of salt, at best it gives you a rough idea of the crawl size & time.
- Max pages to crawl:
- Most of the time you don't have to crawl the whole site to identify common problems. Use this value to limit the number of pages that will be crawled or set it to a higher number to crawl the whole site.
- Max depth:
- If you think of your website as a structured hierarchy of folders, sub-folders and files, this parameter is the number of sub-folders you want to go. Depending on your site content structure this value might be useful to limit the depth of crawl, or set to 0 to go as deep as possible.
- Requests delay:
- When set to 0, WASP will immediately proceed with the next URL as soon as the current page is loaded. If you want to spread out the crawl and lower the load on your server, you can set a number of seconds to wait between each call. Usually the load that can be generated from a single browser is low enough that you can leave this value to 0.
- Page load timeout:
- Sometimes a page will "hang" or load indefinitely. This values specified how much time you are willing to wait for a full page load. After this delay, WASP logs an error in the WTF file and move to the next URL.
- Page load retries:
- If a page fails to load because of a timeout, how many times should WASP retry the same URL?
- Follow robots.txt and META nofollow rules
- Regex include rule:
- Regex exclude rule:
Step 3: Filter
- Your IP address:
- Modify User Agent string:
- Log DCMI META data:
- Increased intelligence:
- Quick load:
- Stealth mode:
Step 4: Summary
Lorem ipsum
Step 5: Status
Lorem ipsum
Step 6: Completion
Lorem ipsum
Last updated: 2009-12-07