Make a Search Engine-Friendly Web Site
Tips from Google
The following tips and guidelines have been provided by Google for appliance customers (and everybody else):
- Publishing Best Practices
- Using Googleon/Googleoff tags
- Robots.txt and the Google Search Appliance
- Robots Meta Tags and the Google Search Appliance
- How To Prevent Indexing (Keeping an entire Web site or a folder or an individual page out of the Google Index)
- Other Google Resources
Google's Publishing Best Practices
When working with the Google Search Appliance, use these tips and guidelines to improve the search experience for users trying to find your content.
Content and Design
Make web pages for users, not for search engines
Create a useful, information-rich content site. Write pages that clearly and accurately describe your content. Don't load pages with irrelevant words. Think about the words users would type to find your pages, and make sure that your site actually includes those words within it.
Focus on text
Focus on the text on your site. Make sure that your TITLE tag (for page titles) and ALT attributes (for describing images) are descriptive and accurate. Since the Google crawler doesn't recognize text contained in images, avoid using graphical text and instead place information within the alt and title attributes for images. When linking to non-HTML documents, use strong descriptions within the anchor text that describe the links your site is making (and make sure to identify the document type in the link text by using DOC, PDF, XLS, PPT, etc.).
Make your site easy to navigate
Make a site with a clear hierarchy of hypertext links. Every page should be reachable from at least one hypertext link. Offer a site map to your users with hypertext links that point to the important parts of your site. Keep the links on a given page to a reasonable number (fewer than 100).
Ensure that your site is linked
Ensure that your site is linked from all relevant sites within your network. Interlinking between sites and within sites gives the Google crawler additional ability to find content, as well as improving the quality of the search.
Make sure that the Google crawler can read your content
Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in multiple copies of the same document being indexed for your site, as crawl robots will see each unique URL (including session ID) as a unique document.
Ensure that your site's internal link structure provides a hypertext link path to all of your pages. The Google search engine follows hypertext links from one page to the next, so pages that are not linked to by others may be missed. Additionally, you should consult the administrator of your Google Search Appliance to ensure that your site's home page is accessible to the search engine.
Use robots standards to control search engine interaction with your content
Make use of the robots.txt file on your web server. This file tells crawlers which files and directories can or cannot be crawled, including various file types. If the search engine gets an error when getting this file, no content will be crawled on that server. The robots.txt file will be checked on a regular basis, but changes may not have immediate results. Each port (including HTTP and HTTPS) requires its own robots.txt file.
Use robots meta tags to control whether individual documents are indexed, to control whether the links on a document should be crawled, and to control whether the document should be cached. The "NOARCHIVE" value for robots meta tags is supported by the Google search engine to block cached content, even though it is not mentioned in the robots standard.
For information on how robots.txt files and ROBOTS meta tags work, see Cal Poly's Web Authoring Resource Center.
Let the search engine know how fresh your content is
Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell the Google Search Appliance whether your content has changed since it last crawled your site. Supporting this feature saves you bandwidth and overhead.
Avoid using frames
The Google search engine supports frames to the extent that it can. Frames tend to cause problems with search engines, bookmarks, e-mail links and so on, because frames don't fit the conceptual model of the web (where every document corresponds to a single URL). Consequently, the use of frames on Web sites is discouraged.
Searches that return framed pages will most likely only produce hits against the "body" HTML page and present it back without the original framed "Menu" or "Header" pages. Google recommends that you use tables or dynamically generate content into a single page (using ASP, JSP, PHP, etc.), instead of using FRAME tags. This will ultimately maintain the content owner's originally intended look and feel, as well as allow most search engines to properly index your content.
Avoid placing content and links in script code
Most search engines do not read any information found in SCRIPT tags within an HTML document. This means that content within script code will not be indexed, and hypertext links within script code will not be followed when crawling. When using a scripting language, make sure that your content and links are outside SCRIPT tags. Investigate alternate HTML technologies to dynamic web pages, such as HTML layers.
The GSA used for CP Search supports googleon and googleoff tags embedded in the HTML of crawled documents.
The googleoff/googleon tags disable the indexing of a part of a web page. The result is that those pages do not appear in search results when users search for the tagged word or phrase. For example, some customers use googleoff/googleon tags to comment out a navigation bar in static HTML pages.
You can use googleon/off to tell the Google Search Appliance to ignore portions of a page. Insert
<!--googleoff: index--> at the point you want the Google Search Appliance to stop indexing, then insert
<!--googleon: index--> where you want it to resume indexing the page.
You can also use the tags to avoid indexing anchor links leading to another web page.
You can use either of the following to prevent the words "chocolate pudding" from appearing in the snippets.
<!--googleoff: snippet--> chocolate pudding <!--googleon: snippet-->
<!--googleoff: all--> chocolate pudding <!--googleon: all-->
The googleon/googleoff tags are index, anchor, snippet, all. Here's how they are used:
Words surrounded by the googleon/off tags will not be indexed as occurring on the current page
A page containing:
fish <!--googleoff: index--> shark <!--googleon: index--> mackerel
has the terms "fish" & "mackerel" indexed for that page, but will not index "shark" for the page. It's possible, however, that the page could be a search result for the search term "shark", since "shark" may occur elsewhere on the page, or in anchortext for links to the page.
"Anchortext" surrounded by the googleon/off tags and occurring in links to other pages will not be indexed as words associated with the other linked-to pages. A page containing:
<!--googleoff: anchor--> <a href="linked_to_page.html"> shark </a> <!--googleon: anchor-->
will not cause the word "shark" to be associated with the page "linked_to_page.html". Otherwise, this hyperlink could cause the page "linked_to_page.html" to be a search result for the search term "shark".
The text surrounded by googleon/off tags will not be used to create snippets for search results.
<!--googleoff: snippet--> come to the fair! <!--googleon: snippet-->
Turns on all of the attributes:
<!--googleoff: all--> come to the fair! <!--googleon: all>
The text surrounded by googleon/off tags will not be indexed, followed to another linked-to page, or used for a snippet.
Robots.txt and the Google Search Appliance
Before crawling any URLs on
www.myserver.com, the appliance will fetch
http://www.myserver.com/robots.txt. The crawl will not fetch any other URLs on this host, unless the web server responds with a 200 or 404 status code to the request for
/robots.txt. If your
/robots.txt file requires authentication, you must be sure to configure the appliance to provide correct credentials. The appliance always obeys the rules in /robots.txt and it is not possible to override this feature.
Each web server's
robots.txt file is cached on the appliance and periodically re-crawled. If you changed
/robots.txt to exclude files that have already been indexed, this change will occur after the affected documents have been scheduled for recrawl and the remove doc ripper has run. You can use the Remove URLs feature if you want the URL removals to take effect immediately.
Robots Meta Tags and the Google Search Appliance
The crawler obeys the noindex, nofollow, and noarchive meta-tags. If you place these tags in the head of your HTML document, you can cause the appliance to not index, not follow, and/or not archive particular documents on your site. The tags to include and their effects are:
<META NAME="robots" CONTENT="noindex">
The crawler will retrieve the document, but it will not index the document. The document will count towards the license limit.
<META NAME="robots" CONTENT="nofollow">
The crawler will not follow any links that are present on the page to other documents. The document will count towards the license limit.
<META NAME="robots" CONTENT="noarchive">
The appliance maintains a cache of all the documents that it fetches, to permit users to access the content that is indexed (in the event that the original host of the content is inaccessible, or the content has changed). If you do not wish to archive a document from your site, you can place this tag in the head of the document, and the appliance will not provide an archive copy for the document. The document will count towards the license limit.
You can also combine any or all of these tags into a single meta tag. For example:
<META NAME="robots" CONTENT="noarchive,nofollow">
Currently, it is not possible to set
NAME="gsa-crawler" to specify some of these restrictions just for the appliance.
How To Prevent Indexing: Keeping an entire Web site or a folder or an individual page out of the Google Index.
For detailed information on how to keep the Google Search Appliance from indexing an entire Web site or a folder see the robots.txt section in the Web Authoring Resource Center for a complete discussion of the robots.txt exclusion method.
For detailed information on how to keep the Google Search Appliance from indexing an individual page see the META Tag Robot Control section in the Web Authoring Resource Center for a complete discussion of the Meta Tag exclusion method.
Other Resources from Google
You may find that the following provides additional information on getting the most out of the google search appliance and your website:
- Google XML Reference (http://code.google.com/gsa_apis/xml_reference.html