What is SEO Information Fingerprint and how to calculate the repeatability of website pages?

Q: What is SEO Information Fingerprint?

Search engines typically evaluate duplicate web pages based on the idea that a set of information fingerprints (Fingerprint) is calculated for each web page. If two web pages have a certain number of identical information fingerprints, the content of the two web pages is considered highly overlapping, i.e. the content of the two pages is duplicated. Many search engines use different methods for evaluating content replication, mainly due to the following two differences: 1. Algorithm for calculating the information fingerprint (Fingerprint); 2. Parameter for judging the similarity of information fingerprints. Before describing a specific algorithm, let's clarify two points: 1. What is an information footprint? An information fingerprint is designed to extract certain information from textual information on a web page. This information can be keywords, words, sentences, or paragraphs and their weights in a web page, and encrypting it, such as MD5 encryption, to form a string. Information fingerprints are similar to human fingerprints, if the content is different, then the information fingerprints are different. 2. The information extracted by the algorithm does not refer to the entire web page, but to the remaining text after filtering out the common parts of the website, such as the navigation bar, logo, copyright and other information (this is called the "noise" of the website or page ).

2,199

· Время на чтение: 4мин · - Nikolay Alekseev · Опубликовано 29.09.2022 · Обновлено 29.04.2023

Listen to this article

What is SEO Information Fingerprint? On the path to SEO, we often face the challenge of writing original articles. Many people think that paragraphs extracted from various articles on the Internet can be combined. In this way, it is possible to create completely new original content, but in fact, sometimes what we imagine is not so simple. Especially with the constant improvement of search engine algorithms. The other side will have many strategies to deal with this problem, such as: fingerprinting the document's information.

So, what is the SEO Information Fingerprint and how to calculate the frequency of website pages?

The content of the article:

Key words: search engine, web page copying, algorithm, information fingerprint, fingerprint, keywords
Segment signature algorithm
Algorithm for copying a page based on keywords

Key words: search engine, web page copying, algorithm, information fingerprint, fingerprint, keywords

Search engines usually rate duplicate web pages based on the idea that every web page a set of information fingerprints is calculated (Fingerprint). If two web pages have a certain number of identical information fingerprints, the content of the two web pages is considered highly overlapping, i.e. the content of the two pages is duplicated.

Many search engines use different methods to evaluate content replication, mainly due to the following two differences:

Algorithm for calculating the information fingerprint (Fingerprint);
Parameter for judging the similarity of information fingerprints.

Before describing a specific algorithm, let's clarify two points:

What is an information footprint? An information fingerprint is designed to extract certain information from textual information on a web page. This information can be keywords, words, sentences, or paragraphs and their weights in a web page, and encrypting it, such as MD5 encryption, to form a string. Information fingerprints are similar to human fingerprints, if the content is different, then the information fingerprints are different.
The information extracted by the algorithm does not refer to the entire web page, but to the remaining text after filtering out common parts of the website, such as the navigation bar, logo, copyright, and other information (this is called the "noise" of the website or page).

Segment signature algorithm

This algorithm cuts the web page into N segments according to certain rules, signs each segment and forms an information fingerprint of each segment. If M of these N fingerprints match (m is a system-defined threshold), they are considered duplicate web pages.

This algorithm is a good algorithm for small-scale evaluation and copying of web pages, but for a large search engine like Google, the complexity of the algorithm is quite high.

Algorithm for copying a page based on keywords

Search engines such as Google, when crawling web pages, will record the following information about a web page:

The keywords found on the web page (word segmentation technology) and the weight of each keyword (keyword density);
Retrieves the meta description or 512 bytes of valid text for each web page.

As for the second point, search engines are different, google pulls your meta description for example if there are not 512 bytes associated with the query keyword.

In the following algorithm description, we reconcile several information footprint variables:

Pi represents the i-th web page;

The N keywords with the highest weight on the web page make up the set Ti={t1,t2,…tn}, and the corresponding weights Wi={w1,w2,…wi}

The abstract information is represented by Des(Pi), the string formed by the first n keywords is represented by Con(Ti), and the string formed by sorting the n keywords is represented by Sort(Ti).

The information fingerprints above are encrypted using the MD5 function.

There are five types of page copy algorithms based on keywords:

MD5(Des(Pi))=MD5(Des(Pj)) which means that the summary information is exactly the same and two web pages i and j are considered duplicate web pages;
MD5(Con(Ti))=MD5(Con(Tj)), the ranking of the first n keywords and their weights are the same for the two web pages, this is considered a duplicate web page;
MD5(Sort(Ti))=MD5(Sort(Tj)), the first n keywords of the two web pages are the same, but the weights may be different, and this is also considered a duplicate web page.
MD5(Con(Ti))=MD5(Con(Tj)) and the sum of the square of Wi-Wj divided by the square of Wi and Wj is less than a certain threshold a, then these two values are considered duplicate web pages.
MD5(Sort(Ti))=MD5(Sort(Tj)) and if the sum of the square of Wi-Wj divided by the square of Wi and Wj is less than a certain threshold a, the two are considered duplicate web pages.

As for the threshold a for the 4th and 5th, this is mainly because under the previous judgment condition, there will still be many web pages that will be accidentally corrupted. The development of the search engine is adjusted according to the weight distribution factor to prevent accidental damage.

Of course, the more computational algorithms you choose, the more accurate the judgment will be, but the computational speed will also slow down. Therefore, it is necessary to take into account the balance between the speed of calculations and the accuracy of deduplication. According to the results of the Skynet test, about 10 keywords are the most relevant.

P.S

The above, of course, cannot cover all aspects of large-scale copying of web pages by search engines, and they must have some ancillary informational judgments about informational fingerprints.

Reading this article:

Thanks for reading: SEO HELPER | NICOLA.TOP

What is SEO Information Fingerprint and how to calculate the repeatability of website pages?

Key words: search engine, web page copying, algorithm, information fingerprint, fingerprint, keywords

Segment signature algorithm

Algorithm for copying a page based on keywords

There are five types of page copy algorithms based on keywords:

P.S

Читайте также:

Добавить комментарий Отменить ответ

Interesting

What is SEO Information Fingerprint and how to calculate the repeatability of website pages?

Key words: search engine, web page copying, algorithm, information fingerprint, fingerprint, keywords

Segment signature algorithm

Algorithm for copying a page based on keywords

There are five types of page copy algorithms based on keywords:

P.S

Читайте также:

Educational content: features of creation and application

Website HTML code optimization - cleaning, compression, error correction

CHATGPT: SEO research tool

Добавить комментарий Отменить ответ

Interesting