Autoencoders – Part 2

Dawson Metzger-Fleetwood


8 min read
Visualization of autoencoder architecture

Identifying Outliers

reconstruction_errors = np.sum(np.square(x_train - decoded_training_imgs), axis=1)
sorted_indices = np.argsort(reconstruction_errors)
plt.imshow(x_train[sorted_indices[-1]].reshape(28, 28), cmap='gray')
A grayscale image of a pixelated shoe

Variational Autoencoders

latent_space_sample = np.random.rand(8,1) # random sample from the latent space
latent_space_sample = latent_space_sample.reshape(1,8).astype('float32') # reshape the sample
latent_space_sample_decoded = decoder.predict(latent_space_sample) # decode the sample
display([(latent_space_sample_decoded,"")]) # display the sample
generated_data = []
generated_data.append((latent_space_sample_decoded, "")) # run this line by itself for each sample generated you want to include
display(generated_data, padding=0.2, figsize=(16, 4))
An image of five grayscale numbers next to each other
While some examples looked amazing, I found only about 25-33% were this high quality. The rest looked like combinations of numbers (e.g. a mix between an 8 and a 3, a 4 and a 9, or a 0 and a 6). To solve this problem and create a more robust handwritten digit generator, we would need a better way of selecting a point from the latent space to feed into the decoder

Data Denoising

image_index = 3
original_image = x_test[image_index]
# Add Gaussian noise to the original image
sigma = 0.2
noise = np.random.normal(0, sigma, original_image.shape)
noisy_image = original_image + noise
noisy_image = np.clip(noisy_image, 0, 1) # Clip the values to be between 0 and 1 in case the Gaussian noise caused pixel values to be out of bounds
noisy_image = noisy_image.reshape(1,784).astype('float32')
encoded_noisy_image = encoder.predict(noisy_image)
decoded_noisy_image = decoder.predict(encoded_noisy_image)
images = [
   (original_image, "Original Image"),
   (noisy_image, "Noisy Image"),
   (decoded_noisy_image, "Decoded Image")

Three pixelated images next to each other with captions


An image of the contrastive loss function
  • In this formula “similarity” represents a function that computes the similarity of a text-image pair, and is something like euclidean distance, cosine similarity, or dot product.
  • tau is the temperature parameter, which adjusts the scale of the similarity scores.
  • The denominator is what aims to distinguish the matching image-text pair from the mismatched pairs.
!pip install transformers
from PIL import Image
import requests
import matplotlib.pyplot as plt
from io import BytesIO
from transformers import CLIPProcessor, CLIPModel

class Result:
 def __init__(self, prediction, probs, image, text):
   self.prediction = prediction
   self.probs = probs
   self.image = image
   self.text = text

class myCLIP():
 def __init__(self):
   self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
   self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
   self.lastResult = None

 def forward(self, image, text):
   inputs = self.processor(text=text, images=image, return_tensors="pt", padding=True)
   outputs = self.model(**inputs)
   logits_per_image = outputs.logits_per_image  # this is the image/text similarity score
   probs = logits_per_image.softmax(dim=1)  # taking the softmax will give us the label probabilities
   prediction = text[probs.argmax(dim=1)]
   probs = probs.tolist()[0]
   self.lastResult = Result(prediction, probs, image, text)
   return prediction

 def displayResult(self, width_in_inches=5):

   # Display the image
   aspect_ratio = image.size[1] / image.size[0]
   height_in_inches = width_in_inches * aspect_ratio
   plt.figure(figsize=(width_in_inches, height_in_inches))

   # Display a histogram of the text class probabilities
   plt.barh(self.lastResult.text, self.lastResult.probs)

   print(f"Prediction: {self.lastResult.prediction}, {max(self.lastResult.probs) * 100:.2f}%")
# load the model
model = myCLIP()

# obtain an image
url = ""
response = requests.get(url)
image =

# these are our text classes
text = ["a photo of a cat", "a photo of a dog", "a photo of a beagle", "a photo of a hound"]

# run the model and then display the result
prediction = model.forward(image, text)
A photo of a beagle
A histogram showing image classification results

Efficient Training

“We train a network that reduces the dimensionality of visual data. This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.”

Newsletter Form
  • Where the future is going: sniffing out all of Apple’s clues
    Where the future is going: sniffing out all of Apple’s clues
    See more: Where the future is going: sniffing out all of Apple’s clues
    Small orange arrow symbol
  • Consider the Following – Part 2
    Consider the Following – Part 2
    See more: Consider the Following – Part 2
    Small orange arrow symbol
Abstract graffiti-style artwork with vivid colors and dynamic shapes.
Simple Form

Connect with our team.


An error has occured, please try again.

Try again