Understanding Generative AI Through Variational Autoencoders (VAEs)

For several years now, we have been witnessing a rapid breakthrough in the field of artificial intelligence, or to be more precise, generative artificial intelligence (GenAI). Many tools have emerged with enormous and constantly growing capabilities in various fields. GenAI can talk like a human, translate texts, write articles and pleadings, create program code, analyze data, generate forecasts, paint pictures, and replace living intelligence. It does all this better and better with each passing month – sometimes surprisingly well. The question arises: How does it do it?

We will try to answer this question using the example of a relatively simple, easy-to-understand, and still-used method. We will see what the principle of operation of the Variational Autoencoder (VAE) is. Below we see an illustration of how VAE works – of course, created by GenAI, and in the rest of the article, we will describe the principle VAEs operation step by step.

The illustration abstractly represents the architecture of an autoencoder with latent spaces, an encoder and decoder, and the main autoencoder in the center.

VAE illustration

The first step is to capture patterns and relationships between features of objects or phenomena. Things from the real world have many features. Importantly, these features are related. A good example of such a relationship is the proportions of the human body, beautifully presented in the famous drawing by Leonardo da Vinci.

The drawing of Leonardo da Vinci 'The Vitruvian Man' on yellowed paper, that is, a figure in a circle and square among handwritten notes.

Vitruvian Man, Leonardo da Vinci, author of the photo Luc Viatour / https://Lucnix.be

To put it simply, based on a person’s height, weight, age, and gender, we can accurately determine, for example, the length of their thumb (with some exceptions, of course). If a machine is to generate a realistic drawing of a human, it must consider such relationships.

The occurrence of relationships between features is common: customer ratings of various products, stock prices of various companies, macroeconomic indicators, product sales, or parameters of the technological process are related. In the simplest case, such a relationship is linear, as in the figure below.

A scatter plot showing a linear relationship between variables X and Y, illustrating the correlation of the data.

Linear relationship between variables

The graph above shows strong relationship between two quantities – these could be, for example, pressure and temperature in a process. Looking at the graph, we can say that both variables are not needed to describe the process – measuring the new variable along the orange axis would suffice. What will remain is random, irrelevant noise. by multiplying the original features by the coefficients and adding them together. The coefficients are chosen to recreate the slope of the orange line on the graph.

 

This is how principal component analysis works: one of the most popular multivariate analyses. For years, it has been used with good results in many fields: from social sciences to batch process control in the pharmaceutical industry. However, the world is not made of balls and springs, and to describe it realistically, we need non-linear relationships. This is where neural networks come to our aid, specifically the autoencoder.

The autoencoder scheme with 10 input neurons, 2 hidden, and 10 output neurons illustrates data compression and decompression.

Autoencoder scheme

The features of objects are provided as input to artificial neurons. In the above figure, we have 10 features that are the inputs to five neurons. Each neuron calculates its output value using a certain general function. In particular, this may be the sum of the feature values ​​multiplied by the so-called weights transformed  using a non-linear function. The outputs calculated in this way enter the inputs of the next layer. In our case, these are two neurons that generate two new variables at their output. They can be described as hidden layers. This process compresses the data or reduces the number of dimensions: we started with 10 features and now have only 2 as the output. The right part of the network reverses this transformation – we start with the hidden layers and get the autoencoder outputs. They should recreate the values ​​of the original features. The weights in all neurons are adjusted so that the replication error is as small as possible – this is what network training is all about. In practice, more complex architectures than the one presented above are used: with more hidden layers, an encoder and decoder containing many layers, and usually more than 10 inputs. An autoencoder can have many applications – for example, in detecting failures, abuses, cyberattacks, or visualizing complex processes. It can also be used to generate new data that matches the pattern of the data on which it was trained. If we provide reasonable values ​​for the hidden layers and run them through the decoder, we should get new objects that match the pattern, i.e., similar to those used in training, but not identical to any of them.

In reality, this is done differently, using a very ingenious trick. Namely, instead of determining specific values, the encoder finds the parameters of the normal distribution: mean and variance. In the case of the network in the figure above, we get two distributions. From the distributions found by the encoder, we randomly select values ​​and insert them into the decoder input. Thanks to this, we can very easily generate new objects, similar but not the same as the existing ones. In addition, this architecture is better in technical terms, including that it is easier to train, and the representation we get is better.

We call such a network a Variational Autoencoder (VAE). It has many variants, e.g., those adapted to image processing, enabling the creation of new, realistic images. The variance autoencoder is also good at detecting anomalies because it allows you to check why a given case was considered unusual.

An interesting application of the autoencoder is the creation of additional frames that improve the smoothness of the image (https://www.nvidia.com/en-us/geforce/news/dlss3-ai-powered-neural-graphics-innovations/, unfortunately, I was unable to find information on what version of the autoencoder is used there).

The autoencoder demonstrates how generative AI operates. First, patterns are identified in the data. Then, the model is trained using data about real objects or events. Finally, this model is used to generate new objects. It is important to note that there is no step in the entire procedure to verify whether the result is reasonable. Complete solutions have many ways to ensure correctness, but it is still the responsibility of the human using GenAI. Often, somewhere in small letters, it is written something like: ‘AI can make mistakes – take this into account!’

Source:
Chollet F., Deep Learning with Python, 2nd ed., 2021 Manning Publications Co.

Author: Tomasz Demski, Development Director at StatSoft

 

Back to news

Do you have questions?

Get in Touch!

Our team is ready to help with any questions you might have. Just fill out the form, send us a message, or give us a call, and we’ll get back to you as soon as we can!

    Skip to content