Python - Cutting out parts of a video using face detection (quick and dirty)

Sauron · November 26, 2020

@MS-DOSposted this topic a couple of days ago:

I was pretty sure this could be achieved with python using a couple of libraries... so I did it.

Here is the script:

Spoiler


import face_recognition
import cv2
import numpy as np
import ffmpeg
import argparse
import os
import shutil

parser = argparse.ArgumentParser()
parser.add_argument("--video", type=str, help="Input video")
parser.add_argument("--face", type=str, help="Input face")
parser.add_argument("--name", type=str, default="target", help="Input name")
parser.add_argument("--scale", type=float, default=1, help="Video scaling for processing")
parser.add_argument("--framerate", type=float, default=29.97, help="Video framerate")
parser.add_argument("--frameskip", type=int, default=30, help="Every how many frames the script should process")
parser.add_argument("--output", type=str, default="out.mp4", help="Output file name")
args = parser.parse_args()

video_capture = cv2.VideoCapture(args.video)

# Load a sample picture and learn how to recognize it.
target_image = face_recognition.load_image_file(args.face)
target_face_encoding = face_recognition.face_encodings(target_image)[0]

# Create arrays of known face encodings and their names
known_face_encodings = [
    target_face_encoding,
]
known_face_names = [
    args.name
]

segments = [{'start': 0, 'end': 0}]
segment_count = 0
last_frame_saved = 0
count = 1
framerate = args.framerate
scale = args.scale

# Initialize some variables
face_locations = []
face_encodings = []
face_names = []
process_this_frame = True

while True:
    # Grab a single frame of video
    ret, frame = video_capture.read()
    if not ret:
        break

    # Resize frame of video for faster face recognition processing
    small_frame = cv2.resize(frame, (0, 0), fx=scale, fy=scale)

    # Convert the image from BGR color (which OpenCV uses) to RGB color (which face_recognition uses)
    rgb_small_frame = small_frame[:, :, ::-1]

    # Only process every other frame of video to save time
    if count % args.frameskip == 0:
        # Find all the faces and face encodings in the current frame of video
        face_locations = face_recognition.face_locations(rgb_small_frame)
        face_encodings = face_recognition.face_encodings(rgb_small_frame, face_locations)

        face_names = []
        for face_encoding in face_encodings:
            # See if the face is a match for the known face(s)
            matches = face_recognition.compare_faces(known_face_encodings, face_encoding)
            name = "Unknown"

            # Or instead, use the known face with the smallest distance to the new face
            face_distances = face_recognition.face_distance(known_face_encodings, face_encoding)
            best_match_index = np.argmin(face_distances)
            if matches[best_match_index]:
                name = known_face_names[best_match_index]

            face_names.append(name)

    # Display the resulting image
    if args.name in face_names:
        if count == last_frame_saved + 1:
            segments[segment_count]['end'] = count
        else:
            segments.append({'start': count, 'end': count})
            segment_count = segment_count + 1
        last_frame_saved = count
        print(segments)
        cv2.imshow('Video3', frame)
    
    count = count + 1

    # Hit 'q' on the keyboard to quit!
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
video_capture.release()
cv2.destroyAllWindows()

os.mkdir("tmp")
inputs = []
for segment in segments:
    segment['start'] = segment['start'] / framerate
    segment['end'] = segment['end'] / framerate
    length = segment['end']-segment['start']
    if length > 0.1:
        ffmpeg.input(
            args.video,
            ss=segment['start'],
            t=length
        ).output(
            "tmp/" + str(segment['start']) + ".mp4"
        ).overwrite_output().run()
        print(ffmpeg.probe("tmp/" + str(segment['start']) + ".mp4"))
        inputs.append(ffmpeg.input("tmp/" + str(segment['start']) + ".mp4").video)
        inputs.append(ffmpeg.input("tmp/" + str(segment['start']) + ".mp4").audio)

ffmpeg.concat(
    *inputs, v=1, a=1
).output(args.output).overwrite_output().run()
shutil.rmtree("tmp")

It accepts a video and an image of the person you want to isolate and outputs a video file containing only the parts where that person is on screen.

To run it you'll need to install ffmpeg, python 3.8 and the following python packages using pip:

pip3.8 install dlib face_recognition opencv-python argparse ffmpeg-python

then pass the command line arguments, e.g.

py.exe -3.8 .\autocut.py --video .\bvt.mp4 --face .\b.PNG --name biden --frameskip 30 --scale 1

As a proof of concept I took this video with highlights from the recent US presidential debates:

and fed this screencap of Biden to the script:

Spoiler

and this was the result:

(this is just a gif for demo purposes, the script produces an mp4 file with the same quality as the input video)

Face recognition is pretty slow on a high resolution video (because of the linearity of the task it would be pretty challenging to parallelize this) so there's the option to skip frames and only check every so often if the person is still in the frame. This is controlled via the frameskip parameter (a value of 30 means that one frame every 30 is checked, meaning roughly 1 second in this case). Skipping frames means the cuts won't be as accurate, for instance you can see the moderator for a few frames. There is also an option to scale down the frames to less than 1 to speed up processing but bear in mind this lowers the accuracy of the face detection algorithm.

It's possible this would be faster using a higher performance language like C++ but it would take longer to get working whereas this is just a few dozen lines long and runs on everything.

There are some edge cases where it doesn't work well due to inherent limitations of face recognition, e.g. when the person is turned as in this frame:

Spoiler

however, for a quick montage from, say, an interview where people always stare at the camera it shouldn't be a problem.

With that said it's probably just as fast and less error prone to just do this manually.

I hope this is useful to somebody if not, at least it was interesting for me.

Windows7ge · November 26, 2020

It's cool that you made this and maybe OP from that thread will find it somewhat helpful but I think what they were asking for was a way to isolate one person talking in a group of people.

That means people talking over each other not one at a time like in a debate.

trag1c · November 26, 2020

Last I heard nvidia contributed a module to openCV for GPU face recognition. According to them there was ~6x perf increase over cpu based approaches.

Sauron · November 26, 2020

2 minutes ago, Windows7ge said:

It's cool that you made this and maybe OP from that thread will find it somewhat helpful but I think what they were asking for was a way to isolate one person talking in a group of people.

That means people talking over each other not one at a time like in a debate.

I'm not sure about that, OP didn't mention audio at all... that would definitely be more complex

Windows7ge · November 26, 2020

2 minutes ago, Sauron said:

I'm not sure about that, OP didn't mention audio at all... that would definitely be more complex

Quote

Lets say you have a TV program where people talk and you have like 6 people, and you want to make a video consisting of only one of the people and cut the rest.

Maybe he can clarify this a little bit more for us then because that's how I interpreted it. Depending on what kind of show he's talking about if anyone talks over anyone else the file would have to be chopped to exclude those parts unless their voice can be isolate from everyone else's.

trag1c · November 26, 2020

Actually this is a lot older than I thought. It was a keynote at GTC 2011 using viola-jones algorithm. But iirc all the new algorithms are learning based anyways so using tensor cores for deep learning would probably be the next step.

Sauron · November 26, 2020

6 minutes ago, trag1c said:

Actually this is a lot older than I thought. It was a keynote at GTC 2011 using viola-jones algorithm. But iirc all the new algorithms are learning based anyways so using tensor cores for deep learning would probably be the next step.

I don't have an nvidia card around to test that, the recognition part of the script only consists of a couple of lines so it should be pretty simple to do a drop in replacement.

trag1c · November 26, 2020

14 minutes ago, Sauron said:

I don't have an nvidia card around to test that, the recognition part of the script only consists of a couple of lines so it should be pretty simple to do a drop in replacement.

I am kinda curios now so if I get time today I might try to whip together a demo that performs the same task as your script.

piemadd · November 26, 2020

~~smh shoulda embeded the code with repl.it~~

sincerely,

definitely not a repl.it employee

piemadd · November 26, 2020

Curious, what if you had the script use ffmpeg to bring the quality of the video down to something lower (lets say 360p), have it analyze that footage, but apply the changes to the original video. Theoretically should go a lot faster with the same quality output.

Sauron · November 26, 2020

5 minutes ago, pierom_qwerty said:

Curious, what if you had the script use ffmpeg to bring the quality of the video down to something lower (lets say 360p), have it analyze that footage, but apply the changes to the original video. Theoretically should go a lot faster with the same quality output.

That's what the scale parameter does (I scale it using opencv), unfortunately the accuracy isn't quite the same. Of course it depends on the video - in this case there are a few shots where the person is pretty far from the camera so lowering the resolution hurts the output.

Mark Kaine · November 26, 2020

1 hour ago, Sauron said:

I'm not sure about that, OP didn't mention audio at all... that would definitely be more complex

yeah, and as I understood it, or at least what I thought would be very challenging is to cut out, ie "isolate" only one person, like you can do in picture editing programs (I think even 3d paint can do that) and I'm pretty sure there are programs to do this with movies too, so you could basically replace one person with another, also known as "deepfake".

And as such whatever the op of the other thread actually wanted, it should definitely be possible (since you can cut out whatever you want and leave only what you want anyway)

I don't really know what programs would be used for that tho, and what are the requirements for them to run...

Still cool you did that with the python program you made, even though the purpose remains unclear.

Sign In

Python - Cutting out parts of a video using face detection (quick and dirty)

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

He Spent 3 YEARS Begging me for a PC. Good Luck Finding it!

Latest From Tech Quickie:

The NEW Chip Inside Your Phone! (NPUs)

Latest From TechLinked:

YouTube Doubles Down

Latest From GameLinked:

The next Must-Play RPGs

Latest From ShortCircuit:

You Deserve this much OLED - AORUS CO49DQ

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!