Automation Action: Web Spider
Crawls a web site and returns a list of all URLs found.
Crawls (spider's) a URL and returns a list of all URLs found. The list can either be returned as a text with one URL per line or as CSV or Json containing each URL, Title, Description and Keywords.
The Web Spider Action only crawls the specified URL. It does not crawl outbound links.
Specify the URL to spider.
Specify any Avoid Patterns (separated by semi colons). Adds wildcard patterns to prevent spidering matching URLs. For example, if "*/assets/*" is added, then any URL containing "/assets/" is not spidered. The "*" character matches zero or more of any character.
Set the Maximum URLs that you want to spider for the site.
Enable the Chop Querystrings to remove the ?query portion from any URLs. This can be done to avoid auto-generated content.
The Web Spider Action will check any robots.txt file. It will not download pages denied by robots.txt
The Return As option can be set to:
URLs one per line
For example:
https://www.testsite.com/
https://www.testsite.com/page2.htm
CSV Containing URL, Title, Description, Keywords
For example:
URL,Title,Description,Keywords
https://www.testsite.com/,Title1,Test Description 1,"keyword1,keyword2"
https://www.testsite.com/page2.htm,Title 2,Test Description 2,"keyword1,keyword2"
JSON Array Containing, URL, Title, Description, Keywords
For example:
[
{
"URL": "https://www.testsite.com/",
"Title": "Title 1",
"Description": "Test Description 1",
"Keywords": "keyword1,keyword2"
},
{
"URL": "https://www.testsite.com/page2",
"Title": "Title 2",
"Description": "Test Description 2",
"Keywords": "keyword1,keyword2"
}
]
Select the variable to receive the results from the Assign To list.
You can also assign a list of outbound links found across all URLs spidered. Select the variable to receive outbound links from the Assign Outbound Links to list. Outbound links are returned as a text string with one link per line.
This Action is useful when you need to load content for an entire site - for example: If loading a site to add to a Knowledge Store. You could first spider a site and then use the For..Each.. Line In action to loop through the site adding each page content to a Knowledge Store Collection, using the page title as the article titles. For example:
// add site to knowledge store
URL =
URLS =
Title =
Content =
URLS = Web Spider URL https://www.mysite.com Avoid *.js;*/assets/*
For Each Line In %URLS% [Assign To URL]
Content = HTTP Get From %URL% Convert To Markdown [Assign Title To: Title]
If %Title% Is Not Blank Then
Embedded Knowledge Store MyKnowledgeBase Update Title = %Title% %Content%
End If
Next Loop
Note: This action may take several minutes for large sites.