yep, that's me



 

FAQ: The Nature of Search Engines

Various notes on the nature of search engines (SE), their abilities, and their limits.

Search Engines as Tools

  • A tool performs an action on a subject. Each tool has its appropriate subjects. All other subjects are inappropriate for that tool.
  • Search engines (SE) are tools. They have the same nature (positive and negative, or abilities and limits) of any tool. There are some things they can find. There are many things they can't find.
  • For example, a hammer can hammer a nail into wood. But a hammer can't hammer a wine glass into wood.
  • There is not, and never will be, a search engine that can find anything, just as there will never be a tool that can do everything.

The Tight-Knit Community (TKC) Problem

  • Google uses link analysis to evaluate the significance of a webpage.
  • Link analysis works well when the topic is clearly defined, there are significant articles about it, and it has an interconnected community.
  • But if a tightly-knit community has many interlinks, this will mislead SEs into ranking the page as significant, when in fact the webpage is irrelevant because the community is wrong.
  • For example, a page may be highly ranked for evolution, when in fact the page is part of a biblical creationist community.
  • For example, a webpage is ranked highly because a bunch of kids create links to it as a spoof (example: a group of kids linked the word "miserable failure" to Bush's webpage. When you search for "miserable failure", the first result is Bush's website.)

The Topic Drift Problem (TDP)

  • If the topic is vague, there aren't good webpages about it, and there aren't interconnected communities that discuss the issue, SEs produce weak, wrong, or random results.
  • To avoid TDP, a webpage should have a clear theme, the theme should be familiar to SEs, the webpage should be embedded within its community (links to the webpage from significant members of the community).

Discover Already-Discovered Information

  • Search engines are good at finding what has already been discovered, identified, described, and summarized.
  • If others know about a topic, they understand the topic, and they convert that knowledge into written information, then search engines can find that knowledge.
  • However, search engines cannot discover new information. If something hasn't yet crystallized into an idea and there aren't articles, books, summaries, or discussions about it, then there is nothing there for search engines to find.
  • This means: Search engines are good for researching school homework. Search engines are poor for researching a graduate thesis. Search engines are useless for researching a doctoral thesis (which requires discovery of new information).

Search Engines Lose Information

  • SEs find information, but they also lose information. The results of a search are based on the SE's algorithm. If someone searches for a phrase, the algorithm will index the web, rank the results, and return the results.
  • But when the algorithm changes, the results will change. Results that were found with the previous algorithm will not match the new algorithm.
  • SE algorithms are updated frequently and without notice. If you searched in April and found a result, you may not find it again when you search in September. There is no way to use a previous algorithm. Users are not aware of this.
  • Commercial SEs are not interested in the reliability of results, the repeatability of results, nor a comprehensive set of results. All of these factors are important for academic researchers.

Nouns and Adjectives

  • SEs can work well if they get good input. The best search uses a noun with adjectives that qualify the noun.
  • For example, the noun is "cat". The adjectives qualify that noun by defining the set into smaller sets, such as "red cat" (not black cats, white cats, or calico cats). A yet smaller set would be "big red cat" (vs. small and medium red cats).
  • This happens to work well with English, which uses nouns and adjectives ("big red cats"). But if a language doesn't use adjective and nouns, SEs will return poor results.

Information Landscapes

  • This is a major challenge for SEs. SEs can find the easy stuff: you can search for "organic cat food" and so on. But it's very hard to find meaningful results for vague concepts such as "procedure enhancement".
  • How would you find information in areas where there is little information? How would you search for "something that few people know about"?
  • Information can be seen as a landscape: lots of related information appears as mountain ranges, with associated hills, valleys, cliffs, and so on. And there are deserts: vast areas where there is only scattered or few information.
  • See examples of information landscapes cybergeography.org/atlas/info_landscapes.html
  • Google and other SEs use "themes" as a concept that clusters similar information together. The general concept "feline" is the cluster for house cats, tigers, and lions. Within the feline cluster, there is the house cats cluster. That holds tabby cats, Persian cats, calico cats, and so on.
Is this Page Useful? Vote!
Updates: andreas.com newsletter

I add new pages every month. Sign up with your email and I'll drop you a note (not more than once a month) about new pages. (See more about the newsletter.)



home | web | jobs | FAQs | other | me | sitemap | legal | © 1994-2008 andreas.com