Hide your messy video background using neural nets, Part 2
Using our trained model to blur the background of video frames with OpenCV.
- Load Learner
- Practicing Predictions
- Constructing the Image With Blurred Background
- Processing a Video Clip
- To Do
In Part 1 we created our own dataset of webcam pictures and trained a model that separates the person from the background. Now, we're going to use this model to blur the background of a webcam video.
The Learner
expects to find all functions that were defined when creating it, in our case that is create_mask
. We don't need any custom functionality however, so we define an empty create_mask
function.
def create_mask(): pass
Load the Learner
we exported in Part 1. If you have not trained a model in part 1, you can download my model and play around. I can't guarantee that it works under any conditions other than my living room though 😀
learn = load_learner('unet-resnet18-person-background.pkl')
Practicing Predictions
predict
function.
Let's pick a random file from our training images to practice getting the model predictions:
fnames = get_image_files('training')
image = fnames[0]
PILImage.create(image).show();
Get predictions of one training image:
preds = learn.predict(image)
preds
There are different tensors in the predictions. preds[0]
contains the output after argmax
, so it picks the class with the higher probability. Every pixel is either a 0
or a 1
in line with our two classes.
preds[0].show(cmap='Blues', vmin=0, vmax=1);
print(f'''unique values: {np.unique(preds[0])}
type: {type(preds[0])}
data type: {preds[0].dtype}''')
preds[1]
contains the same values, just in a different type (TensorImage
instead of TensorMask
)
preds[1].show(cmap='Blues', vmin=0, vmax=1);
print(f'''unique values: {np.unique(preds[1])}
type: {type(preds[1])}
data type: {preds[1].dtype}''')
preds[2]
is a tensor with three dimensions. It contains the probabilities of the two classes as float values.
preds[2].shape
print(f'''unique values: {np.unique(preds[2])}
type: {type(preds[2])}
data type: {preds[2].dtype}''')
Probabilities for the background
class:
preds[2][0].show(cmap='Blues');
Probabilities for the person
class:
preds[2][1].show(cmap='Blues');
We could use clean predictions preds[1]
with just 0
s and 1
s for a simple mask. I tried that initially and it worked, it resulted in some rough edges however.
Instead, we will use the raw probabilities from preds[2][1]
since it results in a smoother image. You can try for yourself which one you like btter.
Let's define a simple blur function.
def blur(img: np.ndarray, kernel_size=5, sigma_x=0) -> np.ndarray:
# Make sure that kernel size is an odd number
if kernel_size % 2 == 0:
kernel_size += 1
return cv2.GaussianBlur(img, (kernel_size, kernel_size), sigma_x)
We now define a function that blurs the background and blends in the original frame with an alpha mask. Thank you to learnopencv.com for their useful code!
def masked_blur(image: np.ndarray, mask: TensorImage) -> np.ndarray:
"mask must have dimensions (360,640)"
foreground = cv2.resize(image, (640,360), interpolation=cv2.INTER_AREA)
background = blur(foreground, kernel_size=61)
# Convert uint8 to float
foreground = foreground.astype(np.float32)
background = background.astype(np.float32)
# Some transforms to match the dimensions and type of the cv2 image
alpha = to_np(mask.unsqueeze(2).repeat(1,1,3)).astype(np.float32)
# Multiply the foreground with the alpha matte
foreground = cv2.multiply(alpha, foreground)
# Multiply the background with ( 1 - alpha )
background = cv2.multiply(1.0 - alpha, background)
# Add the masked foreground and background.
result = cv2.add(foreground, background)
# Convert to integer
result = result.astype(np.uint8)
return result
Read an image and create predictions:
frame = cv2.imread(str(image))
preds = learn.predict(image)
alpha = preds[2][1]
Create the resulting image and have a look:
output = masked_blur(frame, alpha)
output_rgb = cv2.cvtColor(output, cv2.COLOR_BGR2RGB)
PILImage.create(output_rgb)
Apart from my grumpy look, I think this is a quite nice result!
Setting up video files. testclip.mp4
is a video I shot with my webcam. The arguments for the VideoWriter
are framerate and dimensions. I chose 25 because I think this is the framerate of my webcam, and 640x360 are the dimensions we used to train the neural net.
cap = cv2.VideoCapture('testclip.mp4')
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter('output/testclip-output.mp4', fourcc, 25, (640, 360))
Main Loop
We use this while loop to capture every frame of the video. For every frame we
- Resize it to 640x360
- Convert it to from cv2 BGR to RGB
- Use the model to predict the mask
- Create the image with blurred background
- Write this image to the output video
Additionally, we save some frames as jpg
files to inspect them.
i = 0
while cap.isOpened():
# Capture frame
ret, frame = cap.read()
# Break loop at end of video
if ret == False:
break
# Resize frame and convert to RGB
frame = cv2.resize(frame, (640,360), interpolation=cv2.INTER_AREA)
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
# Run inference and create alpha mask from result
preds = learn.predict(frame_rgb)
mask = preds[2][1]
# Blur background and convert it to integer type
output = masked_blur(frame, mask)
# Write frame to video
out.write(output)
# Save every 25th output as jpg, just to find a good thumbnail :)
if i == 0 or i%25 == 0:
cv2.imwrite('output/output_'+str(i)+'.jpg', output)
# Increase counter
i += 1
# Release opened files
cap.release()
out.release()
PILImage.create('output/output_0.jpg')
And the resulting video:
I think that looks quite good. There are some rough edges and my arms are not recognized well, but overall I'm happy with the result for this little project.
To Do
There are many aspects which we could improve:
-
The biggest thing to improve now is inference speed. As I mentioned, the current implementation works only with video files, not live video, and it runs at about 0.5 frames per second 🥴
-
The U-Net is a pretty heavy model, even with the relatively small Resnet18 backbone. The saved weights are 167MB. This alone is reason enough for the model to run slow. Since we run the model frame by frame, the GPU is not helping much because there is no parallelization.
-
The next step would be better generalization. I suspect that this model is currently very much optimized for myself. If we wanted to roll this out as a feature for many people, we would have to include many people in our training dataset, as well as different backgrounds, cameras, and lightning situations.
-
Aesthetics could be improved. There is a "shadow" around the person in the foreground, an artifact of blurring the whole picture including the person.
Let me know when you found this helpful or implemented something similar yourself, or if you're stuck. I'd be happy to hear from you on Twitter!