Can't find a document that I know is on a Web server
If you perform a search query and the specific page you are looking for cannot be found, perform the following checks to determine where the problem may exist:
-
Check the search query terms that you're using
Your content pages may not be considered relevant to the query you entered. Ensure that the query terms you are using exist on your target page. -
Tune up your Web site
Web sites need to be made search engine friendly. Ensuring that your pages are available to Web visitors who perform a search of the Cal Poly Web space is made easier if you have followed the tips and guidelines for making pages more search friendly. Following these guidelines will make your site easier to find and index by search engine crawlers. -
Is your site really linked up?
Make sure your site is ultimately linked from a Web page that is within the list of crawled domains and is also being indexed by the Google Enterprise Search Appliance. In other words, if you want your site to be crawled and indexed by the Google appliance, make sure there are links pointing to it from a page or site that has already been indexed, and make sure your site provides adequate links within it (not Javascript links) for the appliance to follow.- The Google crawl robot may not be able to find your Web page and therefore is not crawling and indexing the page. In order for a Web page to be crawled and indexed by the Google Search Appliance, there must be a valid path of links to your site or page from the list of domains it uses as starting points.
- The Google Search appliance can only find URLs by following links... The crawler can follow normal HTML links and links embedded in Flash content, MS Word documents and PDF files (see What Types of Documents Get Indexed).
- The Google crawler cannot follow links embedded in Javascript code. If you have Javascript links on your site, provide alternative links using plain HTML.
- The crawler cannot submit HTML forms.
-
Is something blocking the search engine crawler
Your Web site or pages may have been intentionally blocked by a robots.txt file or ROBOTS meta tags. It is possible that the Web server on which the expected page is located has a robots.txt file, or the individual Web page itself has a ROBOTS meta tag, specifying that it should not be indexed by the search service. The Google Enterprise Search Appliance crawler will honor these requests and not index the documents. -
Was your website available when the crawler came knocking?
Your web site may have been unavailable when the crawl robot attempted to access it, due to a network or a server outage. If this happens, the Google Enterprise Search Appliance will retry after a certain time interval and for no more than 3 weeks; but if the site cannot be crawled, it will not be included in the index. -
Is your Website on the outs?
Check to see if the Web server is part of the Cal Poly Web space. If it's not, it can't be crawled by the GoogleEnterprise Search Appliance. Make sure the site is hosted in a Web domain that is included in the list of crawled domains. -
Are you expecting a file type that isn't being indexed?
Only certain file-types are being crawled and included in the index in order to maintain a high level of integrity and quality user experience with search results. -
Okay, so maybe we've excluded your site from the index
Lastly, though not likely, we may have blocked either the directory or webserver that contains the page you are looking for. (We have blocked some pages at the request of some campus Web Coordinators due to inappropriateness to campus mission or redundancy with other pages)
If you have looked into these issues and are still having problems with files not appearing in the search results, please contact the Web Coordination Team for assistance.
