Despite such a large variety of bots, their goals, participation in Internet traffic, as well as the level of complexity of their software, the very task of identifying them comes down to the standard classification issue, but we are usually talking about the simplest case of assigning to one of the following two sets: people or bots. There are many classification methods, and their effectiveness depends primarily on the available information about the object that we want to classify.

The following methods are worth mentioning:

  • k-nearest neighbors algorithm,
  • Bayesian networks,
  • logistic regression,
  • decision trees,
  • neural networks,
  • random forests.

Most of them will be described together with examples in the next chapter, but for more information it is worth referring to numerous literature publications. The approach from a practical point of view is presented, e.g. in Charu C. Aggarwal’s book Data Classification: Algorithms and Applications.

The issue of classification is the basis of the machine learning, one of the main sections of which is the supervised learning.It is based on a set of predefined features designated for training data. Optimization of settings related to the indicated features on the training set, allows also a good fit of new, unprecedented cases (test set).

There is also the unsupervised learning called clustering, in which computers find regularities or common features, based on which clustering of data occurs, while the clusters are not predetermined (in contrast to the supervised learning). The key to the problem of classification is the selection of the appropriate classifier, which maximizes the chances that the conclusions drawn from the analysis of the training set will be transferable to observations outside that set.

There are many examples of classification issues in the modern world. Below are a few selected ones:

  • The so-called customer scoring is calculated by the bank for people who apply for a loan. The customer profile is analyzed using the machine learning algorithms that assess credit risk. People with a high-risk ratio are considered bad borrowers and are not likely to be granted a loan.
  • The list of videos proposed for a given user on the platforms like Filmweb or Netflix is built based on the history of ratings, viewing and classification algorithms, also more commonly known as the recommendation systems.
  • The proposals for extending contracts with telephone operators received before the end of the validity of the current contract are the result of classification algorithms aimed at the customer retention.