Some time ago I bought a parrot AR drone 2.0 and thought I would be nice to make it more autonomous. The idea came from this guys who made a drone following a trail (paper, youtube) and from having a dog. Since the drone doesn’t have much sensors, I’d have to do the most out of it’s front camera.

IMG_4264 copy

I thought: How about making the drone follow my hand? Hand recognition is something established, isn’t it?

I started to research about hand recognition. But most ways of doing it, didn’t come up to my expectations. The camera should recognise the hand independent from the background, under bad light conditions, not recognise the face or hands not facing the camera. This should all be very fluent and on top of that not use too much computational power.

Things that I had to lay aside:

  • Background subtraction: The background changes.
  • Looking for skin color tones or something like that: Under bad light conditions the hand becomes gray or darker. Especially if the cam is held against light. It’s a cheap cam.
  • Cascading classifiers: I think it would work, at least what I could achieve with the OpenCV2 haar classifier so far. It’s just that the training takes too long (at least several hours, sadly I don’t have a NVIDIA GPU and OpenCL is slower than the CPU), what makes it hard to correct the image data set. Actually it does good work with distinguishing the face from the hand, but often didn’t recognise anything at all (early stage training). Maybe I was too impatient.

It came down to convolutional neural networks. I was aware that localisation of an object in an image is still being researched, but the sliding window technique seemed promising. Basically it’s a trick to break the localisation problem down to a classification problem.

I produced several thousands of images from video and classified them by hand using a helping tool I built. Some images had a hand on them, others not. It was important the have images of the hand in a lot of situations, especially different light conditions (during the day, evening, with artificial lightning) or against a light source. For every image I saved where the hand is (middle of the palm as x/y coords) or that there is no hand.

Example pictures from the image database (with a hand):

img_14.4.11.121img_14.4.14.134img_14.4.33.255

Labeling tool (you just click on the hand or press space if there is none; it’s possible to label about 2 images / sec with it):

crop

I resized the pictures to 128 x 72 pixels. From these images I cropped 40 x 40 parts with a hand or no hand in it for training. I started with the FANN library, but always run into an overfitting problem. It also seemed that library was not really up to date. So I switched to tensorflow which comes from the Google’s Machine Intelligence research organization. The start was a bit rough, but in the end it worked: It has an algorithm which changes the learning rate accordingly to the validation error and so prevents overfitting. It worked pretty well.

Now I still need to make more pictures for gestures and write the flight controller. I already started the work on recognising gestures; it works halfway-ok by now. I need much more pictures with different gestures. And then I need some hours in a place where I can test my drone since it tends to crash badly in my flat.

I don’t know yet when I’ll have time to get back to this, but I hope to finish and make a video about it.

Here is the Github repository with my messy code (but none of the images).


An example on how accurately the algorithm is (you can see on what kind of hand gesture it was trained):