Data is stored in number of files on the internet, as ASCII files, binary files or data bases. Search engines may vary on the way the data is stored. If the data is stored as the data base the same can be used to create search engines. For html files, graphics & PDFs search engine is an additional program.
A search engine that doesnt have a given content, searches it elsewhere. This data comes from a program that crawls many pages and reads the contents. Such a program is called a ROBOT or a SPIDER. It crawls the URLs specified by the search engine and marks when a new one is found. Google.com differentiates the pages that have been crawled and those that have not. The pages that have been crawled display the pages title on the results page. For those which are not crawled it displays the URL of the page.
When an user searches, he is not actually searching the contents. Instead he is searching an index of the content, the spider has found. In a database driven site, the user performs a query on the content.
1. Simple data queries::
When data is stored in a database, simple queries are possible which is a call to database by a middleware program based on user input. This query looks at a selected number of fields in the database. If it finds a match for the input the database returns the information to the middleware program, which generates a useful HTML display of the content that was found.
The database will be indexed for complex queries where it searches the index instead of the content. It also helps in noise word reduction, stemming and look up tables for content mapping.
2. Complex data queries::
Nielsons recent summary on searching behavior suggests that if users are not successful with their first search they will not improve their search results on a second or third query (Nielsen). Since finding the correct piece of information quickly is important, complex queries are appropriate for keyword searching. They allow the user to ask that a series of conditions about their specific query be met.
3. Boolean searching:
Some search engines allow the user to specify conditions they want to be met in their search results. Boolean searching allows the user to specify groups of words, that should not appear and whether the search should be case sensitive or not. AND or OR can be used to refine the search. These terms are logical expressions included in Boolean searching.
Most search engines allow some or the other form of Boolean searching. It includes syntax for case sensitive searching; but, some data bases store their information in case insensitive fieldtypes.
4. Pre-processed data::
In most search engines the data that the user searches is not the actual pages of information but a dataset of information about whats contained in the pages. This is called an index. The original content is in a database and index is second dataset.
Content indexing creates a document index that contains information about where each word is found on each page. The user performs a search on this index. The display results page translates the information found in the document index back into the information that is on the actual pages.
5. Indexing content::
Databases are sometimes given to improve performance. A search engine can be improved in terms of speed by using an index. An index is used to strip noise words out of the content.
6. Document index::
A document index is special content index. Most search engines make use of it to get responses for keywords. Information about the words in the documents allows the search engines relevancy calculation to return the best result.
7. Noise words::
To save space and time, search engines strip out words when you query the database. Some databases such as MySQL, have noise word rules built in. These general rules can be modified, additional rules can be placed on specific data sets to allow best results. The words that are stripped are called noise words. Noise words may be stripped out based on a specific list of words or length.
Search results display
1. Number of records displayed::
Search results should give relevant information. They can be divided over a number of pages. Nielsen states users almost never look behind the second page of search results (Nielsen). Some results may be lost by pagination.
2. Suggesting new spellings::
Sometimes spelling mistakes occur. To provide more chances of relevant search it suggests alternate spellings. Synonym lists present the user with an alternative word. A spellchecker can provide a list of alternate spellings of a word.
3. Hit highlighting::
In the search results page the words you were searching on are sometimes highlighted in some way. Usually the word is bold. This is hit highlighting.
4. Returning results for each successful query::
Often a user will search for multiple keywords. But each page should be returned only once. Instead of each successful result being displayed once, the search engine must know if a page has already been tagged as containing a result. If so, it should not be re-tagged.