Webcam Eye Tracker: An End-to-end Deep Learning Project

Recently, I wanted to learn PyTorch and needed to find a project to help focus my learning. I have always been interested in the idea of creating a webcam eye tracker, so that seemed like a good project for this. Eye trackers typically rely on infrared for accurate tracking, but performing the same task using purely vision techniques seemed like an interesting challenge.

What follows is a series of posts on the process of creating a webcam eye tracker from scratch. As always, we should start by clarifying the main problems we’re trying to address by going through this process.

The eye tracker problem

I spent a few years working with desktop and mobile eye trackers during my undergraduate degree and throughout graduate school (see a paper we published on dyadic eye tracking here). The problem, however, is that those devices are quite expensive. That is fine in an academic research setting, but far from ideal for use at home.

Additionally, eye trackers are a relatively niche product with only a single purpose. As much as people would love to play around with eye tracking, the need to purchase specialty hardware presents a major barrier. I have been interested in this barrier for a long time. In fact, most of my PhD dissertation was focused on finding ways to use consumer-grade technology to perform the same tasks that would traditionally require some very expensive hardware (in that case, it was replacing 3D motion capture with smartphones for the task of gait analysis).

In the case of an eye tracker, webcams might be a good replacement for infrared cameras in some situations. This is not a particularly novel idea as many have tried the same thing (e.g., see here and here), but I don’t think there’s any requirement to push the boundaries of science in a PyTorch learning project.

The Kaggle problem

Now that we know why a webcam eye tracker might be an interesting project to take on, we have to consider how we should actually approach the problem. The simplest thing to do, of course, would be find a dataset online and start doing things. This is where websites like Kaggle come in handy.

Now, I love Kaggle, but it presents only a narrow slice of what data science is about. By that I mean you can usually jump straight into data modelling, without having to worry about any of the steps that come before or after. For example, there is no need to identify a research problem, figure out what data you need to solve the problem, make decisions about how to collect the data etc. Kaggle is a fantastic website to experiment with pre-existing datasets, but I’m more interested in the entire process.

To that end, I wanted to take an end-to-end approach. I wanted to start with nothing (i.e., no data) and end with a simple application that can track my gaze while I play some video games. But because I’m not completely insane, I’ll allow the use of some pre-existing libraries and frameworks so I’m not doing absolutely everything from scratch. Overall, this process will require making all of the decisions along the way, and it’s those high-level decisions (and some detailed implementations) that are the focus of this series. A bit overkill just to learn PyTorch? Maybe. More fun? Definitely.

A preface and some caveats…

A few things to keep in mind before we get started:

This series is not a step-by-step guide. It only presents a high level overview of how to create a webcam eye tracker. Going into every detail would take far too long. However, we will be going through some of the most important steps of the project.
In terms of the eye tracker itself, the goal was to very rapidly prototype the idea and test whether it is even minimally viable. As such, the application is not a final product. It is barely even a prototype! At the end of the day, this is still just a PyTorch learning project. So don’t expect the application to be ready for widespread use.
The full source code will be provided, but may not work on all computers due to hardware differences. However, my hope is that people will take the code and experiment, improve, and adapt it for their own purposes.
The series is aimed at people that have some familiarity with machine learning and convolutional models. I won’t be explaining the details of how these things work, as the focus of the series is really on project structure and the types of things that needed to be considered at each stage.

Research plan

So, what steps do we need to take to go from nothing to eye tracking video games? Big picture view:

A framework for capturing and processing video from a webcam
Perform face detection, face alignment, eye detection, and calculate any other necessary features
Create a platform for collecting the dataset as efficiently and robustly as possible
Model the data
Deploy the model to a simple application that records the screen and makes gaze predictions

Along the way we will need to think about issues like how to manually collect and label a large enough dataset without going insane. Or how to deal with corneal reflections being caused by a triple monitor setup. All that and more in the next few blog posts.

As a quick tease of the final application, here is the webcam eye tracker working on a blank screen:

We’ll kick things off in the next post, where we’ll be looking at webcam capture and face detection.