Navigation

Basic Information About the Search Appliance used by Cal Poly

Image of a Google Search Appliance unit - bright yellow with holes in the front panel

 

What is the appliance

The Cal Poly search service (CP Search) uses a Google Enterprise Search Appliance (ESA) for crawling and indexing documents on Cal Poly Web sites. The Google ESA, which is located at the Office of the Chancellor, is independent of the commercial Google search engine at www.google.com, yet it uses the same search technology software that's used for the commercial version of Google. Consequently, using CP Search to search Cal Poly websites feels similar to using the Google.com website, and users should expect the same quality and search accuracy found at Google.com. The primary difference between the two is that the Cal Poly search service only includes websites that fall inside the several internet domains used by Cal Poly (more or less, see below). This page provides an overview of what sites the search service indexes, and how the appliance indexes those sites.

[Back to Top]

What is crawling

Okay, here's a basic explanation of the crawling process: crawling is what happens when a search engine sends a robot program (a crawler) to a start URL (web address) to discover the various files (web pages and files) that are interconnected from this start URL. The files found by the crawler are returned to the search server and parsed for indexable documents (eg. certain image files are not indexable). The search engine then indexes (see below) the files by extracting the document's location, a summary of the page's content, all of the keywords, and special fields and other special information that Google doesn't tell us about. This information is then inserted into the master index collection for searching.

Cal Poly's search service crawls the campus "Web space" on a 24x7 schedule.

[Back to Top]

What gets indexed within the Cal Poly "Web Space"

The Cal Poly search service has three important parameters that define what websites it will and won't crawl: a list of URLs to start crawling from, a list of domains within which it is allowed to crawl, and a list of URL patterns that it shouldn't crawl. The most salient of these lists is the list of domains (the Cal Poly "Web Space") within which the crawler is allowed to crawl (see table below). The crawler crawls nearly all pages (URLs) contained on all webservers within these domains. These domains are considered to be strongly related to the Cal Poly academic mission:

The Cal Poly "Web Space"
Search Domain "Web Space" Web Sites Associated with the Search Domain
calpoly.edu University Web sites in the calpoly.edu domain (e.g. includes www.afd.calpoly.edu and mustangdaily.calpoly.edu)
calpolyarts.org Cal Poly Arts Program
calpolycorporation.org Cal Poly Corporation
calpolyfoundation.org Cal Poly Foundation
pacslo.org Performing Arts Center of San Luis Obispo
elcorralbookstore.com El Corral Bookstore
gopoly.com Cal Poly Athletics/Sports Information
spranch.org Swanton Pacific Ranch
educationalwebservices.com Cal Poly Corporation Web Services
cphousingcorp.org Cal Poly Housing Corporation
itrc.org Cal Poly Irrigation Training and Research Center
xerxes.calstate.edu/slo/ Article Database front-end for Cal Poly's Kennedy Library

 

If the domain name of a Web site in question contains one of the entries from this list (e.g. www.cafes.calpoly.edu or ceng.calpoly.edu ), then it is very likely it is getting indexed by the search service. If the domain name of a website in question does not match one of the entries from the list above (e.g. www.calpolyjobs.org), then it won't be included in the search service index.

This Web space of Web pages is crawled continuously (24x7) by the Cal Poly search service, which means that a new site will often be searchable within one to three days of its go-live date (assuming it's been linked to: see How to Get Your Site Indexed (or Not).

[Back to Top]

How the Google ESA crawls and finds web pages

One of the more common questions asked is "how does the Google ESA find and crawl my site?" While Google certainly keeps the magic of its search accuracy a secret, they have made no secret about how the search engine crawls through a Web pages.

Using the Googlebot indexing agent, the crawling process begins at the home page of each of the above domains and proceeds to crawl ALL URLs that are linked from these pages and contained within the above domains. It may take 100 or even 1000 jumps for the crawler to find a page, but it will find it if it is ultimately linked from another page that gets indexed:

The Googlebot crawler can only find URLs by following links... The crawler can follow normal HTML links and links embedded in Flash content, MS Word documents and PDF files (see What Types of Documents Get Indexed below).

The Google crawler cannot follow links embedded in Javascript code. If you have Javascript links on your site, provide alternative links using plain HTML.

The Google crawler cannot submit HTML forms.

In other words, if you want your site to be crawled and indexed by the appliance, make sure it's getting linked to by a page or site that has already been indexed, and make sure your site provides adequate HTML-based links for the appliance to follow.

For more information on how to get a Web site indexed by the search appliance see the section How to Get Your Site Indexed (or not).

[Back to Top]

What types of documents get indexed

The Googlebot indexing agent will do a full index on several types of documents. Although the Google ESA is capable of indexing WELL OVER 200 different file-types, the Office of the Chancellor has limited the variety of file-types indexed in order to maintain a high quality user experience. Too many file-types will tend to clutter search results with "noise" links.

The Google Enterprise Search Appliance is set to index URLs that end with the following file-type extensions: .html, .htm, .ihtml, .ghtml, .phtml, .shtml, .asp, .jsp, .pl, .php, .cfm, .xml, .doc, .dot, .xls, .pdf, .ppt.

These include the following standard and vendor formats:

File Types that are Indexed
File Type File Format
HTML HTML output based Web documents (this includes .html, .htm, .ihtml, .ghtml, .phtml, .shtml, .asp, .jsp, .pl, .php, .cfm, and .xml)
PDF Adobe Acrobat PDF documents.
DOC, DOT Microsoft Word (and other formats that use the .doc file-type extension)
XLS Microsoft Excel
PPT Microsoft Power Point

 

NOTE: As stated above in the What gets indexed into the collection section, a URL can only be indexed if it is ultimately linked from the start page at the top of each of the Web Space URLs identified above (even if the link is 1000 links deep from the Home Page).

[Back to Top]

When does the search engine get updated (what is the crawl schedule)

The search service (the Google ESA) sends out the Googlebot web crawler to crawl campus Web sites on a continuous 24 hour by 7 day per week basis. You will know that the Googlebot has visited your Web sites by looking at your Web server log files for the User Agent Name: csu-gsa-crawler

The Google crawler includes changed, newly created, or newly removed pages automatically during its crawls.

If your web page has been changed, the search results should reflect your updated webpage within 24 to 72 hours. If your page has rarely changed, it could take up to a week before the search appliance notices that the page has been changed. See the "Can't find a document that I know is on a webserver" section if your changed document is NOT reflected in the search results.

If the search engine is generating too much traffic on your site during peak hours, please contact the Web Coordination Team to customize the traffic.

[Back to Top]

What is the master index collection versus a collection

The Google ESA master index collection is a repository (index) of all of the URLs that were crawled and accepted by the search process and contains among other things document locations, document attributes, and key words.

Collections (or sub-sets of the master index collection) can be created to provide the basis for a customized search. A collection contains specific URLs pulled from the master index collection. A manager account can be used on the Google search appliance to customize collection and front-end combinations to provide custom search results. To learn more about customized searches, see the section on Customizing Your Search.

[Back to Top]

How are the page descriptions (the snippet) created on the search results page

The summary or as Google calls it, "the snippet", is determined by word concentration in the document, keyword proximity, and other factors. Information Technology Services has no control over how the summary is developed.