Autoencoders – Part 2

Dawson Metzger-Fleetwood

|

8 min read
Visualization of autoencoder architecture


Identifying Outliers

reconstruction_errors = np.sum(np.square(x_train - decoded_training_imgs), axis=1)
sorted_indices = np.argsort(reconstruction_errors)
print(sorted_indices[-1])
plt.imshow(x_train[sorted_indices[-1]].reshape(28, 28), cmap='gray')
plt.axis('off') 
plt.show()
A grayscale image of a pixelated shoe


Variational Autoencoders

latent_space_sample = np.random.rand(8,1) # random sample from the latent space
latent_space_sample = latent_space_sample.reshape(1,8).astype('float32') # reshape the sample
latent_space_sample_decoded = decoder.predict(latent_space_sample) # decode the sample
display([(latent_space_sample_decoded,"")]) # display the sample
generated_data = []
generated_data.append((latent_space_sample_decoded, "")) # run this line by itself for each sample generated you want to include
display(generated_data, padding=0.2, figsize=(16, 4))
An image of five grayscale numbers next to each other
While some examples looked amazing, I found only about 25-33% were this high quality. The rest looked like combinations of numbers (e.g. a mix between an 8 and a 3, a 4 and a 9, or a 0 and a 6). To solve this problem and create a more robust handwritten digit generator, we would need a better way of selecting a point from the latent space to feed into the decoder


Data Denoising

image_index = 3
original_image = x_test[image_index]
# Add Gaussian noise to the original image
sigma = 0.2
noise = np.random.normal(0, sigma, original_image.shape)
noisy_image = original_image + noise
noisy_image = np.clip(noisy_image, 0, 1) # Clip the values to be between 0 and 1 in case the Gaussian noise caused pixel values to be out of bounds
noisy_image = noisy_image.reshape(1,784).astype('float32')
encoded_noisy_image = encoder.predict(noisy_image)
decoded_noisy_image = decoder.predict(encoded_noisy_image)
images = [
   (original_image, "Original Image"),
   (noisy_image, "Noisy Image"),
   (decoded_noisy_image, "Decoded Image")
   ]

display(images)
Three pixelated images next to each other with captions


CLIP

An image of the contrastive loss function
  • In this formula “similarity” represents a function that computes the similarity of a text-image pair, and is something like euclidean distance, cosine similarity, or dot product.
  • tau is the temperature parameter, which adjusts the scale of the similarity scores.
  • The denominator is what aims to distinguish the matching image-text pair from the mismatched pairs.
!pip install transformers
from PIL import Image
import requests
import matplotlib.pyplot as plt
from io import BytesIO
from transformers import CLIPProcessor, CLIPModel


class Result:
 def __init__(self, prediction, probs, image, text):
   self.prediction = prediction
   self.probs = probs
   self.image = image
   self.text = text


class myCLIP():
 def __init__(self):
   self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
   self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
   self.lastResult = None


 def forward(self, image, text):
   inputs = self.processor(text=text, images=image, return_tensors="pt", padding=True)
   outputs = self.model(**inputs)
   logits_per_image = outputs.logits_per_image  # this is the image/text similarity score
   probs = logits_per_image.softmax(dim=1)  # taking the softmax will give us the label probabilities
   prediction = text[probs.argmax(dim=1)]
   probs = probs.tolist()[0]
   self.lastResult = Result(prediction, probs, image, text)
   return prediction


 def displayResult(self, width_in_inches=5):


   # Display the image
   aspect_ratio = image.size[1] / image.size[0]
   height_in_inches = width_in_inches * aspect_ratio
   plt.figure(figsize=(width_in_inches, height_in_inches))
   plt.imshow(self.lastResult.image)
   plt.axis('off')
   plt.show()


   # Display a histogram of the text class probabilities
   plt.barh(self.lastResult.text, self.lastResult.probs)
   plt.xlabel('Probability')
   plt.show()


   print(f"Prediction: {self.lastResult.prediction}, {max(self.lastResult.probs) * 100:.2f}%")
# load the model
model = myCLIP()

# obtain an image
url = "https://upload.wikimedia.org/wikipedia/commons/5/55/Beagle_600.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

# these are our text classes
text = ["a photo of a cat", "a photo of a dog", "a photo of a beagle", "a photo of a hound"]

# run the model and then display the result
prediction = model.forward(image, text)
model.displayResult()
A photo of a beagle
A histogram showing image classification results


Efficient Training

“We train a network that reduces the dimensionality of visual data. This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.”


Newsletter Form
Abstract graffiti-style artwork with vivid colors and dynamic shapes.
Simple Form

Connect with our team.

Error

An error has occured, please try again.

Try again