Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
TEXTS UNIT 3.doc
Скачиваний:
1
Добавлен:
19.11.2019
Размер:
103.42 Кб
Скачать

Article 6 the classification of search engine spam

Abstract

This document has been written to allow search engine marketers and other industry professionals to objectively evaluate actions to see whether those actions equate to spamming a search engine. It is hoped that quality search engines, ethical marketers and search industry professionals will agree that this document lays out standards which the industry should strive for.

With standards come definitions. Often in the search industry, the same terms are used by different people to mean different things. These different meanings can cause confusion and give spammers refuge. An objective of this paper, then, is to place absolute definitions on some important terminology.

The following terms are defined:

Search engine

Relevancy

Search Engine Spam

Not Search Engine Spam

Content spam

Meta spam

Link farm

Link content spam

Link meta spam

Agent-Based Spam

IP Cloaking

The first term we will define is "Search Engine". Generally, a search engine is any program that searches a database and produces a list of results. To work at such an abstract level within this document would limit us to a very theoretical generalist discussion. Therefore, for the purposes of this document, we will apply a more narrow definition of "search engine" as follows:

Search engine

a system that uses automated techniques, such as robots (a.k.a. spiders) and indexers, to create indexes of the Web, allows those indexes to be searched according to certain search criteria, and delivers a set of results ordered by relevancy to those search criteria. Examples of such search engines are AltaVista, Fast, Google and Inktomi. (Fast and Inktomi deliver their results solely through partners such as Lycos and MSN).

The next term that needs defining is relevancy. Because this document is attempting to classify spam, and spam and relevancy are intertwined, it is essential that we define relevancy in an objective way. That is not to say that relevancy is objective. Far from it. Relevancy is extremely subjective. Every search engine uses its own algorithm to calculate relevancy. Therefore, we define relevancy as follows:

Relevancy

The search engine's measure of how well a particular resource matches the input search criteria. Each search engine measures relevancy using its own algorithm. Therefore, given the same set of resources and the same input search criteria, each search engine will produce a different set of results. This is because the results are ordered by relevancy, and each search engine calculates relevancy differently.

It should be clear that the algorithms that calculate relevancy are the life blood of search engines. Those search engines that deliver the most relevant results to the market they have chosen to focus upon should be the most successful search engines in those markets.

Search Engine Spam

So, what is search engine spam? We define it as follows: Any attempt to deceive a search engine's relevancy algorithm. And what isn't spam? Not Search Engine Spam. Anything that would still be done if search engines did not exist, or anything that a search engine has given written permission to do.

The remainder of this document assumes that the search engine has not given written permission. It elaborates upon the meaning of the previous two definitions and places them in a context that should be acceptable to all industry professionals.

In attempting to classify spam, we considered many different instances of spam and architectures for delivering spam. We gradually came to realise that there are only two types of search engine spam:

Content Spam

Data within a part of a Web resource designed for humans (e.g. the of a HTML document) where that data is designed only for search engines to see

Meta Spam

Data within a Web resource that describes that resource or another Web resource inaccurately or (when the data should be readable by humans) incoherently.

The fact that there are only two types of search engine spam derives from the fact that search engine algorithms use only two basic factors to calculate relevancy; on-the-page factors and off-the-page factors. An example of an on-the-page factor is keyword density - how early and often the keywords (words searched for) appear in the body copy of a page. An example of an off-the-page factor is link popularity i.e. how many other pages on the Web link to a particular page. In fact, depending on the link popularity algorithm, it can be spammed with either content spam or meta spam. This will be described in more detail later.

Content spam

First of all, we should consider why content spam is possible. It is possible because the same URL can deliver different content (or the same content displayed in different ways) to different visitors to that URL. Even the simplest versions of HTTP and HTML support this, and therefore offer the opportunity to deliver spam. For example, IMG support and ALT text within HTML means that image-enabled visitors to a URL will see different content to those visitors that, for various reasons, cannot view images. Whether the ability to deliver spam results in the delivery of spam is largely a matter of knowledge and ethics.

This document is not designed to provide exhaustive examples of spam. To do so would be counter productive as it could become a reference source for those that wish to spam. Suffice it to say that the following techniques are among those that may be subverted to deliver content spam: tiny text, invisible text, noframes text, noscript text, alt text, longdesc text.

It is extremely important to note that none of the above techniques were designed to deliver spam. Therefore, the use of the technique does not imply that spamming is taking place. So, how can we determine whether the use of the technique constitutes spam? It is relatively simple - apply this test:

Suppose search engines did not exist. Would the technique still be used in the same way?

If the answer to the above question is no, then clearly the content is designed only for search engines to see. Therefore it is spam. If you are a search engine marketer or search engine optimization (SEO) specialist, don't panic at this statement. Consider what it really means.

Take, as an example, ALT text. Why was the tag invented? Not to deliver spam, but to provide a readable version of the page to browsers without graphical capabilities. These include phones, PDAs and screen readers for the visually impaired. This last example is especially important as disability legislation in many countries (e.g. USA, UK, Australia) requires that content is accessible to all. Stuffing the ALT text of clear pixels with lists of keywords is a common SEO technique. Consider this sample piece of HTML, where clear.gif is a 1x1 transparent pixel and an attempt is being made to rank higher for the word "spam":

This turns a page into meaningless garbage when it is read out loud or displayed on a non-graphical browser.

Tags that have been designed to improve access for the disabled, or less capable platforms, are often subverted to deliver spam. Yet it is possible - and professionally essential - to use these tags in the manner for which they were invented. Consider the impact of doing so. The site is usable by more visitors, from more platforms. If marketing is your goal, then you are reaching a wider market. This is an ethically sound policy. It improves access for all and improves your overall marketing capability. At the same time it does not deliver spam which spoils a search engine's ability to calculate relevancy or makes a page meaningless to visitors with lower capabilities.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]