Real-Time Head Pose Estimation With OpenCV and Dlib

by Sabbir Ahmed


Posted on: 4 years ago


Photo Courtesy: Joey Tribbiani from F.R.I.E.N.D.S — (I’m not even sorry)
Photo Courtesy: Joey Tribbiani from F.R.I.E.N.D.S — (I’m not even sorry)

Head pose estimation is used widely in various computer vision applications- like VR applications, hands-free gesture-controlled applications, driver’s attention detection, gaze estimation, and many more. In this post, we are going to learn how to estimate head pose with OpenCV and Dlib.

Pose estimation in respect of Computer Vision

In computer vision, pose estimation specifically refers to the relative orientation of the object with respect to a camera. The reference frame here is the field of the camera. Pose estimation often referred to as a Perspective-n-Point problem or PnP problem in computer vision. The problem definition is simple, I will try to simplify this with bullet points, Given a set of 3D Points in a world reference and corresponding 2D image taken by the camera —

  • Reconstruct 6DOF pose of the reference — the Euler Angles (Roll, Pitch, and Yaw) and The camera movement from one 3D point to another 3D point. In plain text, we have to find Translation and Rotation.
  • 3D Reconstruction from 2D reference

Let’s examine the definition of the problem as an equation —

Image for post

PNP problem statement

The left side of the equation s[u v t]ᵗ denotes the 2D image taken by the camera. The right side of the equation, the first potion which looks like an upper triangular matrix is our camera matrix where f(x, y)is the focal lengths γ is the skew parameter which we will leave as 1 in our code. (u₀, v₀) are the center of our image. The middle portion, r and t represents rotation and translation, the final portion denotes the 3D model of the face which we will see in a bit. For now, we will leave our theory and maths in their grace.


Get The Code

by-sabbir/HeadPoseEstimation

Understanding The “WHY and How!?”

It is obvious that our application needs to detect face and predict shape to solve PnP and eventually estimate pose. For detecting faces and predicting shapes we will use dlib.


# use models/downloader.sh
PREDICTOR_PATH = os.path.join("models", "shape_predictor_68_face_landmarks.dat")

if not os.path.isfile(PREDICTOR_PATH):
    print("[ERROR] USE models/downloader.sh to download the predictor")
    sys.exit()

detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor(PREDICTOR_PATH)

We are now set to solve PnP. But before that, one last time from the equation above we need

  • 3D face model
  • Reference 2D matrix
  • Camera matrix

to solve PnP and get the rotation and translation of the image(face in our case). Let’s code the prerequisites —


def ref3DModel():
    modelPoints = [[0.0, 0.0, 0.0],
                   [0.0, -330.0, -65.0],
                   [-225.0, 170.0, -135.0],
                   [225.0, 170.0, -135.0],
                   [-150.0, -150.0, -125.0],
                   [150.0, -150.0, -125.0]]
    return np.array(modelPoints, dtype=np.float64)


def ref2dImagePoints(shape):
    imagePoints = [[shape.part(30).x, shape.part(30).y],
                   [shape.part(8).x, shape.part(8).y],
                   [shape.part(36).x, shape.part(36).y],
                   [shape.part(45).x, shape.part(45).y],
                   [shape.part(48).x, shape.part(48).y],
                   [shape.part(54).x, shape.part(54).y]]
    return np.array(imagePoints, dtype=np.float64)


def CameraMatrix(fl, center):
    cameraMatrix = [[fl, 1, center[0]],
                    [0, fl, center[1]],
                    [0, 0, 1]]
    return np.array(cameraMatrix, dtype=np.float)

The six-coordinate points of ref3Dmodel() and ref2DImagePoints() refers to nose tip, chin, left corner of the left eye, right corner of the right eye, corners of the mouth. You can visualize the whole points and add them as you want by manipulating Visualize3DModel.py.

OpenCV provides two simple APIs to solve PnP

  • solvePnP
  • solvePnPRansac

In our case, we will use solvePnP. By convention, this API needs 4 input parameters —

  • objectPoints →ref3Dmodel()
  • imagePoints →ref2DImagePoints()
  • cameraMatrix →cameraMatrix()
  • distCoefs → np.zeros(4, 1)

by resolving PnP the API returns success messagerotation vector , and translation vector matrix.


face3Dmodel = world.ref3DModel()
refImgPts = world.ref2dImagePoints(shape)

height, width, channels = img.shape
focalLength = args.focal * width
cameraMatrix = world.cameraMatrix(focalLength, (height / 2, width / 2))

mdists = np.zeros((4, 1), dtype=np.float64)

# calculate rotation and translation vector using solvePnP
success, rotationVector, translationVector = cv2.solvePnP(
    face3Dmodel, refImgPts, cameraMatrix, mdists)

The focal length of the camera must be calibrated as it is an intrinsic property of the camera hardware.

Getting the Euler Angles

Finally, we can easily extract the roll, pitch, and yaw. In other words, we have to extract Euler Angles by RQDecomp3x3 API of OpenCV. But it needs a rotation matrix rather than a rotation vector. Fortunately, we already have calculated the rotation vector using solvePnP. There is another API of OpenCV called Rodrigues which converts a rotation matrix to a rotation vector or vice versa. This is how we can implement this —


# calculating angle
rmat, jac = cv2.Rodrigues(rotationVector)
angles, mtxR, mtxQ, Qx, Qy, Qz = cv2.RQDecomp3x3(rmat)

The final output should look something like this —

Sample Head Pose Estimation

Concluding with Summary

We started detecting and predicting the shapes of a face. Then we calculated rotation and translation vector with solvePnP. Finally got the rotation angle with RQDecomp3x3. Yet another easy 3 steps process. Computer vision is fun.