• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Researchers at HSE in St Petersburg Develop Superior Machine Learning Model for Determining Text Topics

Researchers at HSE in St Petersburg Develop Superior Machine Learning Model for Determining Text Topics

© iStock

They also revealed poor performance of neural networks on such tasks

Topic models are machine learning algorithms designed to analyse large text collections based on their topics. Scientists at HSE Campus in St Petersburg compared five topic models to determine which ones performed better. Two models, including GLDAW developed by the Laboratory for Social and Cognitive Informatics at HSE Campus in St Petersburg, made the lowest number of errors. The paper has been published in PeerJ Computer Science.

Determining the topic of a publication is usually not difficult for the human brain. For example, any editor can easily tag this article with science, artificial intelligence, and machine learning. However, the process of sorting information can be time-consuming for a person, which becomes critical when dealing with a large volume of data. A modern computer can perform this task much faster, but it requires solving a challenging problem: identifying the meaning of documents based on their content and categorising them accordingly.

This is achieved through topic modelling, a branch of machine learning that aims to categorise texts by topic. Topic modelling is used to facilitate information retrieval, analyse mass media, identify community topics in social networks, detect trends in scientific publications, and address various other tasks. For example, analysing financial news can accurately predict trading volumes on the stock exchange, which are significantly influenced by politicians' statements and economic events.

Here's how working with topic models typically unfolds: the algorithm takes a collection of text documents as input. At the output, each document is assessed for its degree of belonging to specific topics. These assessments are based on the frequency of word usage and the relationships between words and sentences. Thus, words such as ‘scientists,’ ‘laboratory,’ ‘analysis,’ ‘investigated,’ and ‘algorithms’ found in this text categorise it under the topic of ‘science.’

However, many words can appear in texts covering various topics. For example, the word ‘work’ is often used in texts about industrial production or the labour market. However, when used in the phrase ‘scientific work,’ it categorises the text as pertaining to ‘science.’ Such relationships, expressed mathematically through probability matrices, form the core of these algorithms.

Topic models can be enhanced by creating embeddings—fixed-length vectors that describe a specific entity based on various parameters. These embeddings serve as additional information acquired through training the model on millions of texts. 

Any phrase or text, such as this news item, can be represented as a sequence of numbers—a vector or a vector space. In machine learning, these numerical representations are referred to as embeddings. The idea is that measuring spaces and detecting similarities becomes easier, allowing comparisons between two or more texts. If the similarities between the embeddings describing the texts are significant, then they likely belong to the same category or cluster—a specific topic.

Scientists at the HSE Laboratory for Social and Cognitive Informatics in St Petersburg examined five topic models—ETM, GLDAW, GSM, WTM-GMM and W-LDA, which are based on different mathematical principles:

  • ETM is a model proposed by the prominent mathematician David M. Blei, who is one of the founders of the field of topic modelling in machine learning. His model is based on latent Dirichlet allocation and employs variational inference to calculate probability distributions, combined with embeddings.
  • Two models—GSM and WTM-GMM—are neural topic models.
  • W-LDA is based on Gibbs sampling and incorporates embeddings, but also uses latent Dirichlet allocation, similar to the Blei model.
  • GLDAW relies on a broader collection of embeddings to determine the association of words with topics.

For any topic model to perform effectively, it is crucial to determine the optimal number of categories or clusters into which the information should be divided. This is an additional challenge when tuning algorithms.

Sergei Koltsov

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

Typically, a person does not know in advance how many topics are present in the information flow, so the task of determining the number of topics must be delegated to the machine. To accomplish this, we proposed measuring a certain amount of information as the inverse of chaos. If there is a lot of chaos, then there is little information, and vice versa. This allows for estimating the number of clusters, or in our case, topics associated with the dataset. We applied these principles in the GLDAW model.

The researchers investigated the models for stability (number of errors), coherence (establishing connections), and Renyi entropy (measuring the degree of chaos). The algorithms' performance was tested on three datasets: materials from a Russian-language news resource Lenta.ru and two English-language datasets - 20 Newsgroups and WoS. This choice was made because all texts in these sources were initially assigned tags, allowing for evaluation of the algorithms' performance in identifying the topics.

The experiment showed that ETM outperformed other models in terms of coherence on the Lenta.ru and 20 Newsgroups datasets, while GLDAW ranked first for the WoS dataset. Additionally, GLDAW exhibited the highest stability among the tested models, effectively determined the optimal number of topics, and performed well on shorter texts typical of social networks.

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

We improved the GLDAW algorithm by incorporating a large collection of external embeddings derived from millions of documents. This enhancement enabled more accurate determination of semantic coherence between words and, consequently, more precise grouping of texts.

GSM, WTM-GMM and W-LDA demonstrated lower performance than ETM and GLDAW across all three measures. This finding surprised the researchers, as neural network models are generally considered superior to other types of models in many aspects of machine learning. The scientists have yet to determine the reasons for their poor performance in topic modelling.

See also:

Larger Groups of Students Use AI More Effectively in Learning

Researchers at the Institute of Education and the Faculty of Economic Sciences at HSE University have studied what factors determine the success of student group projects when they are completed with the help of artificial intelligence (AI). Their findings suggest that, in addition to the knowledge level of the team members, the size of the group also plays a significant role—the larger it is, the more efficient the process becomes. The study was published in Innovations in Education and Teaching International.

New Models for Studying Diseases: From Petri Dishes to Organs-on-a-Chip

Biologists from HSE University, in collaboration with researchers from the Kulakov National Medical Research Centre for Obstetrics, Gynecology, and Perinatology, have used advanced microfluidic technologies to study preeclampsia—one of the most dangerous pregnancy complications, posing serious risks to the life and health of both mother and child. In a paper published in BioChip Journal, the researchers review modern cellular models—including advanced placenta-on-a-chip technologies—that offer deeper insights into the mechanisms of the disorder and support the development of effective treatments.

Using Two Cryptocurrencies Enhances Volatility Forecasting

Researchers from the HSE Faculty of Economic Sciences have found that Bitcoin price volatility can be effectively predicted using Ethereum, the second-most popular cryptocurrency. Incorporating Ethereum into a predictive model reduces the forecast error to 23%, outperforming neural networks and other complex algorithms. The article has been published in Applied Econometrics.

Administrative Staff Are Crucial to University Efficiency—But Only in Teaching-Oriented Institutions

An international team of researchers, including scholars from HSE University, has analysed how the number of non-academic staff affects a university’s performance. The study found that the outcome depends on the institution’s profile: in research universities, the share of administrative and support staff has no effect on efficiency, whereas in teaching-oriented universities, there is a positive correlation. The findings have been published in Applied Economics.

Physicists at HSE University Reveal How Vortices Behave in Two-Dimensional Turbulence

Researchers from the Landau Institute for Theoretical Physics of the Russian Academy of Sciences and the HSE University's Faculty of Physics have discovered how external forces affect the behaviour of turbulent flows. The scientists showed that even a small external torque can stabilise the system and extend the lifetime of large vortices. These findings may improve the accuracy of models of atmospheric and oceanic circulation. The paper has been published in Physics of Fluids.

Solvent Instead of Toxic Reagents: Chemists Develop Environmentally Friendly Method for Synthesising Aniline Derivatives

An international team of researchers, including chemists from HSE University and the A.N. Nesmeyanov Institute of Organoelement Compounds of the Russian Academy of Sciences (INEOS RAS), has developed a new method for synthesising aniline derivatives—compounds widely used in the production of medicines, dyes, and electronic materials. Instead of relying on toxic and expensive reagents, they proposed using tetrahydrofuran, which can be derived from renewable raw materials. The reaction was carried out in the presence of readily available cobalt salts and syngas. This approach reduces hazardous waste and simplifies the production process, making it more environmentally friendly. The study has been published in ChemSusChem.

How Colour Affects Pricing: Why Art Collectors Pay More for Blue

Economists from HSE University, St Petersburg State University, and the University of Florida have found which colours in abstract paintings increase their market value. An analysis of thousands of canvases sold at auctions revealed that buyers place a higher value on blue and favour bright, saturated palettes, while showing less appreciation for traditional colour schemes. The article has been published in Information Systems Frontiers.

New Method for Describing Graphene Simplifies Analysis of Nanomaterials

An international team, including scientists from HSE University, has proposed a new mathematical method to analyse the structure of graphene. The scientists demonstrated that the characteristics of a graphene lattice can be represented using a three-step random walk model of a particle. This approach allows the lattice to be described more quickly and without cumbersome calculations. The study has been published in Journal of Physics A: Mathematical and Theoretical.

Scientists Have Modelled Supercapacitor Operation at Molecular and Ionic Level

HSE scientists used supercomputer simulations to study the behaviour of ions and water molecules inside the nanopores of a supercapacitor. The results showed that even a very small amount of water alters the charge distribution inside the nanopores and influences the device’s energy storage capacity. This approach makes it possible to predict how supercapacitors behave under different electrolyte compositions and humidity conditions. The paper has been published in  Electrochimica Acta.  The study was supported by a grant from the Russian Science Foundation (RSF).

Designing an Accurate Reading Skills Test: Why Parallel Texts are Important in Dyslexia Diagnosis

Researchers from the HSE Centre for Language and Brain have developed a tool for accurately assessing reading skills in adults with reading impairments. It can be used, for instance, before and after sessions with a language therapist. The tool includes two texts that differ in content but are equal in complexity: participants were observed to read them at the same speed, make a similar number of errors, and understand the content to the same degree. Such parallel texts will enable more accurate diagnosis of dyslexia and better monitoring of the effectiveness of interventions aimed at addressing it. The paper has been published in Educational Studies.