The machine learning behind fluid.photo

fluid.photo was an AI image editing service based on StyleGAN2 that I developed in 2020. I’ll explain the ML behind it by going step-by-step through what happens when a user uploads a photo.

Detect and Align

Step 0 - Upload
A user uploads a photo.

Step 1 - Detect
The face detection model predicts the locations of faces and prompts the user to select one to edit.

Step 2 - Landmark
The locations of various facial landmarks are predicted for the selected face.

Step 3 - Align
The image is transformed to standardize the location of key facial points, then cropped.

Encode and Finetune

Step 4 - Encode
The aligned image is passed to the encoder to predict w (a vector in the disentangled latent space of StyleGAN2, size 512) and w+ (the extended latent space, size 18x512).

The encoder was trained to minimize the image distance between the input image and the generator output. There’s also a regularization term in the loss function to encourage the encoder to minimize the magnitude of ∆w+ (the distance between w and w+), thus preserving editability.

Image distance is defined as a weighted sum of the L2 distance in the pixel space, the L1 distance in the intermediate feature space of a pretrained VGG-19, and the negative cosine similarity in the output embedding space of a facial recognition model.

Step 5 - Finetune
The parameters of the generator are finetuned such that w+ results in a more faithful reconstruction of the target image.

Step 6 - Distill
The modulator is finetuned such that it can approximate finetuned StyleGAN2 image edits in the intermediate feature space of an adversarial autoencoder.

The autoencoder was trained to compress 256x256 aligned face images into a 512x4x4 encoding then reconstruct them according to the image distance metric mentioned earlier, plus an adversarial loss from a discriminator.

The modulator was trained to convert autoencoder encodings according to a delta in the latent space of StyleGAN2. It consists of a sequence of ResNet blocks modulated by Adaptive Layer-Instance Normalization.

Preview

Step 7 - Preview
The user uses sliders to produce edits to the aligned image.

To minimize latency, the images are generated by the modulator and the decoder half of the adversarial autoencoder. Together, they act as a lightweight student model that approximates the finetuned StyleGAN2 edits.

The model “F” produces deltas in the disentangled latent space according to the starting point in the latent space and the users sliders. This was trained via an amalgamation of various techniques including text prompts using CLIP, clustering using an attribute classifier, and conditional continuous normalizing flows.

Finalize

Throughout this section, “inner” refers to the aligned 1024x1024 face that StyleGAN uses, “outer” refers to a zoomed out 1536x1536 version of the aligned image, and “original” refers to the unaligned image uploaded by the user.

Step 8 - Finalize
The finetuned StyleGAN2 generates the new inner image.

Step 9 - Segment
The segmenter creates segmentation maps for the new inner image and the old outer image. The outer seg predictor predicts what the new outer segmentation map would look like.

Step 10 - Inpaint
The inpainter produces a new outer image from an inpaint mask and the naive combination of the new inner image and the old outer image.

inpaint mask = (a ∪ b) ∩ (1 - (c ∩ d)) ∩ (1 - e)
where:
a = dilation of the old outer face/hair segments
b = dilation of the new outer face/hair segments
c = new outer face/hair segments
d = inner mask
e = outer boundary mask

Step 11 - Unalign
Unalign the new outer image and insert it into the original image.

Production

In production, this whole process is split across 7 workers, so that the individual parts of the system can scale independently of each other.

The compute server coordinates everything and pushes worker jobs to the Redis queue. The workers pull jobs from Redis and read/write data from either the database server (small data e.g. face boxes) or the cloud storage (large files e.g. images).

To minimize latency, the preview workers interact directly with the compute server by sending base64 encoded preview images through the broker.

Coming Soon

I'm working on a follow-up post about the behind-the-scenes (how these models were trained, some of the different experiments that were tried before settling on this system, etc.).