Panscient operates a large-scale web crawler which crawls millions of websites on a regular basis. Similar to the web crawlers used by the large search engines, our web crawler crawls public websites looking for specific types of information to include in vertical search engines.
Panscient primarily crawls the web looking for corporate information, such as company names, addresses, executive biographies, job openings and product information. We also crawl the web to locate genealogy pages, such as birth, marriage and death records, obituaries and census records.
Our web crawler only accesses publicly available information published on websites. We respect the rights of website owners to control what content our crawler analyzes. Our crawler obeys the Robot Exclusion Standard, and will not collect content from any pages that are off-limits to robots.
We crawl the entire list of registered .com domain names, which is publicly available through Verisign. Once you register a domain name, our crawler will periodically check it for business information.
The Panscient web crawler identifies itself using the user-agent "panscient.com", and obeys the Robot Exclusion Standard. To exclude the Panscient web crawler from accessing portions of your site, please modify your website's robots.txt file to identify the directories and files which the crawler should not request. Our web crawler also obeys the robots meta-tag directives of "noindex" and "nofollow", which can be placed in the header section of individual web pages.
To completely exclude our web crawler from your site, add the following entry to your robots.txt file:
User-Agent: panscient.com
Disallow: /
Our web crawler attempts to extract links to valid web pages from javascript and other scripting languages. The crawler may misinterpret the information in these scripts and request a page that does not actually exist. These requests are attempts to retrieve valid web content, and are not an attempt to circumvent your webserver security.
The Panscient web crawler will request a page at most once every second from the same domain name or the same IP address.
Contact us at crawler@panscient.com and we'll do our best to respond to your query promptly.