Jump to content

Python - Cutting out parts of a video using face detection (quick and dirty)

Sauron

@MS-DOSposted this topic a couple of days ago:

I was pretty sure this could be achieved with python using a couple of libraries... so I did it.

 

Here is the script:

Spoiler

import face_recognition
import cv2
import numpy as np
import ffmpeg
import argparse
import os
import shutil

parser = argparse.ArgumentParser()
parser.add_argument("--video", type=str, help="Input video")
parser.add_argument("--face", type=str, help="Input face")
parser.add_argument("--name", type=str, default="target", help="Input name")
parser.add_argument("--scale", type=float, default=1, help="Video scaling for processing")
parser.add_argument("--framerate", type=float, default=29.97, help="Video framerate")
parser.add_argument("--frameskip", type=int, default=30, help="Every how many frames the script should process")
parser.add_argument("--output", type=str, default="out.mp4", help="Output file name")
args = parser.parse_args()

video_capture = cv2.VideoCapture(args.video)

# Load a sample picture and learn how to recognize it.
target_image = face_recognition.load_image_file(args.face)
target_face_encoding = face_recognition.face_encodings(target_image)[0]

# Create arrays of known face encodings and their names
known_face_encodings = [
    target_face_encoding,
]
known_face_names = [
    args.name
]

segments = [{'start': 0, 'end': 0}]
segment_count = 0
last_frame_saved = 0
count = 1
framerate = args.framerate
scale = args.scale

# Initialize some variables
face_locations = []
face_encodings = []
face_names = []
process_this_frame = True

while True:
    # Grab a single frame of video
    ret, frame = video_capture.read()
    if not ret:
        break

    # Resize frame of video for faster face recognition processing
    small_frame = cv2.resize(frame, (0, 0), fx=scale, fy=scale)

    # Convert the image from BGR color (which OpenCV uses) to RGB color (which face_recognition uses)
    rgb_small_frame = small_frame[:, :, ::-1]

    # Only process every other frame of video to save time
    if count % args.frameskip == 0:
        # Find all the faces and face encodings in the current frame of video
        face_locations = face_recognition.face_locations(rgb_small_frame)
        face_encodings = face_recognition.face_encodings(rgb_small_frame, face_locations)

        face_names = []
        for face_encoding in face_encodings:
            # See if the face is a match for the known face(s)
            matches = face_recognition.compare_faces(known_face_encodings, face_encoding)
            name = "Unknown"

            # Or instead, use the known face with the smallest distance to the new face
            face_distances = face_recognition.face_distance(known_face_encodings, face_encoding)
            best_match_index = np.argmin(face_distances)
            if matches[best_match_index]:
                name = known_face_names[best_match_index]

            face_names.append(name)

    # Display the resulting image
    if args.name in face_names:
        if count == last_frame_saved + 1:
            segments[segment_count]['end'] = count
        else:
            segments.append({'start': count, 'end': count})
            segment_count = segment_count + 1
        last_frame_saved = count
        print(segments)
        cv2.imshow('Video3', frame)
    
    count = count + 1

    # Hit 'q' on the keyboard to quit!
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
video_capture.release()
cv2.destroyAllWindows()

os.mkdir("tmp")
inputs = []
for segment in segments:
    segment['start'] = segment['start'] / framerate
    segment['end'] = segment['end'] / framerate
    length = segment['end']-segment['start']
    if length > 0.1:
        ffmpeg.input(
            args.video,
            ss=segment['start'],
            t=length
        ).output(
            "tmp/" + str(segment['start']) + ".mp4"
        ).overwrite_output().run()
        print(ffmpeg.probe("tmp/" + str(segment['start']) + ".mp4"))
        inputs.append(ffmpeg.input("tmp/" + str(segment['start']) + ".mp4").video)
        inputs.append(ffmpeg.input("tmp/" + str(segment['start']) + ".mp4").audio)

ffmpeg.concat(
    *inputs, v=1, a=1
).output(args.output).overwrite_output().run()
shutil.rmtree("tmp")

 

It accepts a video and an image of the person you want to isolate and outputs a video file containing only the parts where that person is on screen.

 

To run it you'll need to install ffmpeg, python 3.8 and the following python packages using pip:

pip3.8 install dlib face_recognition opencv-python argparse ffmpeg-python

 

then pass the command line arguments, e.g.

py.exe -3.8 .\autocut.py --video .\bvt.mp4 --face .\b.PNG --name biden --frameskip 30 --scale 1

 

As a proof of concept I took this video with highlights from the recent US presidential debates:

and fed this screencap of Biden to the script:

Spoiler

b.PNG.e3894bf47588c683947f17ce04a651c8.PNG

and this was the result:

output.gif.74dca950e78313c8667625472498a8d6.gif

 

(this is just a gif for demo purposes, the script produces an mp4 file with the same quality as the input video)

 

Face recognition is pretty slow on a high resolution video (because of the linearity of the task it would be pretty challenging to parallelize this) so there's the option to skip frames and only check every so often if the person is still in the frame. This is controlled via the frameskip parameter (a value of 30 means that one frame every 30 is checked, meaning roughly 1 second in this case). Skipping frames means the cuts won't be as accurate, for instance you can see the moderator for a few frames. There is also an option to scale down the frames to less than 1 to speed up processing but bear in mind this lowers the accuracy of the face detection algorithm.

 

It's possible this would be faster using a higher performance language like C++ but it would take longer to get working whereas this is just a few dozen lines long and runs on everything.

 

There are some edge cases where it doesn't work well due to inherent limitations of face recognition, e.g. when the person is turned as in this frame:

Spoiler

image.png.7c160c36c924cd74442e94abb5b3cba8.png

however, for a quick montage from, say, an interview where people always stare at the camera it shouldn't be a problem.

 

With that said it's probably just as fast and less error prone to just do this manually.

 

I hope this is useful to somebody ;) if not, at least it was interesting for me.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

It's cool that you made this and maybe OP from that thread will find it somewhat helpful but I think what they were asking for was a way to isolate one person talking in a group of people.

 

That means people talking over each other not one at a time like in a debate.

Link to comment
Share on other sites

Link to post
Share on other sites

Last I heard nvidia contributed a module to openCV for GPU face recognition. According to them there was ~6x perf increase over cpu based approaches.

CPU: Intel i7 - 5820k @ 4.5GHz, Cooler: Corsair H80i, Motherboard: MSI X99S Gaming 7, RAM: Corsair Vengeance LPX 32GB DDR4 2666MHz CL16,

GPU: ASUS GTX 980 Strix, Case: Corsair 900D, PSU: Corsair AX860i 860W, Keyboard: Logitech G19, Mouse: Corsair M95, Storage: Intel 730 Series 480GB SSD, WD 1.5TB Black

Display: BenQ XL2730Z 2560x1440 144Hz

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Windows7ge said:

It's cool that you made this and maybe OP from that thread will find it somewhat helpful but I think what they were asking for was a way to isolate one person talking in a group of people.

 

That means people talking over each other not one at a time like in a debate.

I'm not sure about that, OP didn't mention audio at all... that would definitely be more complex

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Sauron said:

I'm not sure about that, OP didn't mention audio at all... that would definitely be more complex

Quote

Lets say you have a TV program where people talk and you have like 6 people, and you want to make a video consisting of only one of the people and cut the rest.

Maybe he can clarify this a little bit more for us then because that's how I interpreted it. Depending on what kind of show he's talking about if anyone talks over anyone else the file would have to be chopped to exclude those parts unless their voice can be isolate from everyone else's.

Link to comment
Share on other sites

Link to post
Share on other sites

Actually this is a lot older than I thought. It was a keynote at GTC 2011 using viola-jones algorithm. But iirc all the new algorithms are learning based anyways so using tensor cores for deep learning would probably be the next step.

CPU: Intel i7 - 5820k @ 4.5GHz, Cooler: Corsair H80i, Motherboard: MSI X99S Gaming 7, RAM: Corsair Vengeance LPX 32GB DDR4 2666MHz CL16,

GPU: ASUS GTX 980 Strix, Case: Corsair 900D, PSU: Corsair AX860i 860W, Keyboard: Logitech G19, Mouse: Corsair M95, Storage: Intel 730 Series 480GB SSD, WD 1.5TB Black

Display: BenQ XL2730Z 2560x1440 144Hz

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, trag1c said:

Actually this is a lot older than I thought. It was a keynote at GTC 2011 using viola-jones algorithm. But iirc all the new algorithms are learning based anyways so using tensor cores for deep learning would probably be the next step.

I don't have an nvidia card around to test that, the recognition part of the script only consists of a couple of lines so it should be pretty simple to do a drop in replacement.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

14 minutes ago, Sauron said:

I don't have an nvidia card around to test that, the recognition part of the script only consists of a couple of lines so it should be pretty simple to do a drop in replacement.

I am kinda curios now so if I get time today I might try to whip together a demo that performs the same task as your script.

CPU: Intel i7 - 5820k @ 4.5GHz, Cooler: Corsair H80i, Motherboard: MSI X99S Gaming 7, RAM: Corsair Vengeance LPX 32GB DDR4 2666MHz CL16,

GPU: ASUS GTX 980 Strix, Case: Corsair 900D, PSU: Corsair AX860i 860W, Keyboard: Logitech G19, Mouse: Corsair M95, Storage: Intel 730 Series 480GB SSD, WD 1.5TB Black

Display: BenQ XL2730Z 2560x1440 144Hz

Link to comment
Share on other sites

Link to post
Share on other sites

smh shoulda embeded the code with repl.it

 

sincerely,

definitely not a repl.it employee

i like trains 🙂

Link to comment
Share on other sites

Link to post
Share on other sites

Curious, what if you had the script use ffmpeg to bring the quality of the video down to something lower (lets say 360p), have it analyze that footage, but apply the changes to the original video. Theoretically should go a lot faster with the same quality output.

i like trains 🙂

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, pierom_qwerty said:

Curious, what if you had the script use ffmpeg to bring the quality of the video down to something lower (lets say 360p), have it analyze that footage, but apply the changes to the original video. Theoretically should go a lot faster with the same quality output.

That's what the scale parameter does (I scale it using opencv), unfortunately the accuracy isn't quite the same. Of course it depends on the video - in this case there are a few shots where the person is pretty far from the camera so lowering the resolution hurts the output.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Sauron said:

I'm not sure about that, OP didn't mention audio at all... that would definitely be more complex

yeah, and as I understood it, or at least what I thought would be very challenging is to cut out, ie "isolate" only one person, like you can do in picture editing programs (I think even 3d paint can do that) and I'm pretty sure there are programs to do this with movies too, so you could basically replace one person with another, also known as "deepfake". 

And as such whatever the op of the other thread actually wanted, it should definitely be possible (since you can cut out whatever you want and leave only what you want anyway) 

 

 

I don't really know what programs would be used for that tho, and what are the requirements for them to run... 

 

Still cool you did that with the python program you made, even though the purpose remains unclear. ;)

 

 

The direction tells you... the direction

-Scott Manley, 2021

 

Softwares used:

Corsair Link (Anime Edition) 

MSI Afterburner 

OpenRGB

Lively Wallpaper 

OBS Studio

Shutter Encoder

Avidemux

FSResizer

Audacity 

VLC

WMP

GIMP

HWiNFO64

Paint

3D Paint

GitHub Desktop 

Superposition 

Prime95

Aida64

GPUZ

CPUZ

Generic Logviewer

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×