Basic Information About the Search Appliance used by Cal Poly

- What is the appliance
- What is crawling
- What gets indexed within the Cal Poly "Web Space"
- How the Google ESA crawls and finds Web pages
- What types of documents get indexed
- When does the search engine get updated (what is the crawl schedule)
- What's a collection
- How are the page descriptions (the snippet) created on the search results page
What is the appliance
The Cal Poly search service (CP Search) uses a Google Enterprise Search Appliance (ESA) for crawling and indexing documents on Cal Poly Web sites. The Google ESA, which is located at the Office of the Chancellor, is independent of the commercial Google search engine at www.google.com, yet it uses the same search technology software that's used for the commercial version of Google. Consequently, using CP Search to search Cal Poly websites feels similar to using the Google.com website, and users should expect the same quality and search accuracy found at Google.com. The primary difference between the two is that the Cal Poly search service only includes websites that fall inside the several internet domains used by Cal Poly (more or less, see below). This page provides an overview of what sites the search service indexes, and how the appliance indexes those sites.
What is crawling
Okay, here's a basic explanation of the crawling process: crawling is what happens when a search engine sends a robot program (a crawler) to a start URL (web address) to discover the various files (web pages and files) that are interconnected from this start URL. The files found by the crawler are returned to the search server and parsed for indexable documents (eg. certain image files are not indexable). The search engine then indexes (see below) the files by extracting the document's location, a summary of the page's content, all of the keywords, and special fields and other special information that Google doesn't tell us about. This information is then inserted into the master index collection for searching.
Cal Poly's search service crawls the campus "Web space" on a 24x7 schedule.
What gets indexed within the Cal Poly "Web Space"
The Cal Poly search service has three important parameters that define what websites it will and won't crawl: a list of URLs to start crawling from, a list of domains within which it is allowed to crawl, and a list of URL patterns that it shouldn't crawl. The most salient of these lists is the list of domains (the Cal Poly "Web Space") within which the crawler is allowed to crawl (see table below). The crawler crawls nearly all pages (URLs) contained on all webservers within these domains. These domains are considered to be strongly related to the Cal Poly academic mission:
| Search Domain "Web Space" | Web Sites Associated with the Search Domain |
|---|---|
calpoly.edu |
University Web sites in the calpoly.edu domain (e.g. includes www.afd.calpoly.edu and mustangdaily.calpoly.edu) |
calpolyarts.org |
Cal Poly Arts Program |
calpolycorporation.org |
Cal Poly Corporation |
calpolyfoundation.org |
Cal Poly Foundation |
pacslo.org |
Performing Arts Center of San Luis Obispo |
elcorralbookstore.com |
El Corral Bookstore |
gopoly.com |
Cal Poly Athletics/Sports Information |
spranch.org |
Swanton Pacific Ranch |
educationalwebservices.com |
Cal Poly Corporation Web Services |
cphousingcorp.org |
Cal Poly Housing Corporation |
itrc.org |
Cal Poly Irrigation Training and Research Center |
xerxes.calstate.edu/slo/ |
Article Database front-end for Cal Poly's Kennedy Library |
If the domain name of a Web site in question contains one of the entries from this list (e.g. www.cafes.calpoly.edu or ceng.calpoly.edu ), then it is very likely it is getting indexed by the search service. If the domain name of a website in question does not match one of the entries from the list above (e.g. www.calpolyjobs.org), then it won't be included in the search service index.
This Web space of Web pages is crawled continuously (24x7) by the Cal Poly search service, which means that a new site will often be searchable within one to three days of its go-live date (assuming it's been linked to: see How to Get Your Site Indexed (or Not).
How the Google ESA crawls and finds web pages
One of the more common questions asked is "how does the Google ESA find and crawl my site?" While Google certainly keeps the magic of its search accuracy a secret, they have made no secret about how the search engine crawls through a Web pages.
Using the Googlebot indexing agent, the crawling process begins at the home page of each of the above domains and proceeds to crawl ALL URLs that are linked from these pages and contained within the above domains. It may take 100 or even 1000 jumps for the crawler to find a page, but it will find it if it is ultimately linked from another page that gets indexed:
The Googlebot crawler can only find URLs by following links... The crawler can follow normal HTML links and links embedded in Flash content, MS Word documents and PDF files (see What Types of Documents Get Indexed below).
The Google crawler cannot follow links embedded in Javascript code. If you have Javascript links on your site, provide alternative links using plain HTML.
The Google crawler cannot submit HTML forms.
In other words, if you want your site to be crawled and indexed by the appliance, make sure it's getting linked to by a page or site that has already been indexed, and make sure your site provides adequate HTML-based links for the appliance to follow.
For more information on how to get a Web site indexed by the search appliance see the section How to Get Your Site Indexed (or not).
What types of documents get indexed
The Googlebot indexing agent will do a full index on several types of documents. Although the Google ESA is capable of indexing WELL OVER 200 different file-types, the Office of the Chancellor has limited the variety of file-types indexed in order to maintain a high quality user experience. Too many file-types will tend to clutter search results with "noise" links.
The Google Enterprise Search Appliance is set to index URLs that end with the following file-type extensions: .html, .htm, .ihtml, .ghtml, .phtml, .shtml, .asp, .jsp, .pl, .php, .cfm, .xml, .doc, .dot, .xls, .pdf, .ppt.
These include the following standard and vendor formats:
| File Type | File Format |
|---|---|
HTML |
HTML output based Web documents (this includes .html, .htm, .ihtml, .ghtml, .phtml, .shtml, .asp, .jsp, .pl, .php, .cfm, and .xml) |
PDF |
Adobe Acrobat PDF documents. |
DOC, DOT |
Microsoft Word (and other formats that use the .doc file-type extension) |
XLS |
Microsoft Excel |
PPT |
Microsoft Power Point |
NOTE: As stated above in the What gets indexed into the collection section, a URL can only be indexed if it is ultimately linked from the start page at the top of each of the Web Space URLs identified above (even if the link is 1000 links deep from the Home Page).
When does the search engine get updated (what is the crawl schedule)
The search service (the Google ESA) sends out the Googlebot web crawler to crawl campus Web sites on a continuous 24 hour by 7 day per week basis. You will know that the Googlebot has visited your Web sites by looking at your Web server log files for the User Agent Name: csu-gsa-crawler
The Google crawler includes changed, newly created, or newly removed pages automatically during its crawls.
If your web page has been changed, the search results should reflect your updated webpage within 24 to 72 hours. If your page has rarely changed, it could take up to a week before the search appliance notices that the page has been changed. See the "Can't find a document that I know is on a webserver" section if your changed document is NOT reflected in the search results.
If the search engine is generating too much traffic on your site during peak hours, please contact the Web Coordination Team to customize the traffic.
What is the master index collection versus a collection
The Google ESA master index collection is a repository (index) of all of the URLs that were crawled and accepted by the search process and contains among other things document locations, document attributes, and key words.
Collections (or sub-sets of the master index collection) can be created to provide the basis for a customized search. A collection contains specific URLs pulled from the master index collection. A manager account can be used on the Google search appliance to customize collection and front-end combinations to provide custom search results. To learn more about customized searches, see the section on Customizing Your Search.
How are the page descriptions (the snippet) created on the search results page
The summary or as Google calls it, "the snippet", is determined by word concentration in the document, keyword proximity, and other factors. Information Technology Services has no control over how the summary is developed.
