Sapfundament - Experiment with neural network for rapidly capturing human expression
Please find the second article in this series here
My interactive artworks involve capturing human expression and using it to animate something non-human. What is human expression when it’s removed from the human body? When a performer or participant is empowered to express themselves with light, sound, architecture etc.
This ‘weird mirroring’ is a common theme in many interactive arts practices. In my experience the essential task here is: taking an expressive human as input, analysing the expression, and then applying it to something else.
Breaking it down that’s three sub tasks:
For my time at the Amsterdam choreographic coding lab I want to take a stab at answering a very specific question about the mapping task - can human expression be extracted by a broadly trained neural network? Particularly a neural network trained and generated (relatively) rapidly so it would be suitable for deployment in a dance rehearsal process. How does this approach compare to the ‘traditional’ process EG coding interaction by hand (logical description)? If the broad neural approach has merit that could be very useful, as I spend hours trying to get a sense of human expression out of code. Maybe the neural network could do it faster.
We’ll also take a swipe at a related philosophical question. ‘Can machines understand human expression?’. Even though I’d never expect to generate anything as concrete as ‘HumanExpression.dat’ it’s often surprising with machine learning to discover concepts that are quite abstract to humans can be seen clearly as patterns by machines. People often attribute this success to machines somehow becoming more like us, but I think that’s a misleadingly mystical approach. I think the truth is instead that humans have more pattern and logic to our emotional and artistic behaviour than we realise.
Like all art we are dealing with subjective results here. Beauty in the eye of the beholder etc.
The form of this experiment is definitely influenced by my experience of making art, I’d understand if it doesn’t reflect the approach you would take.
Going back to the subtasks
I reckon there are two approaches to the mapping task.
The first is what I’ll call ‘traditional’.
-Looking at the animation above: To tell a machine that some dots should move when a human moves I’ve got to logically map the dots to the human. This is coding. I might code that: when the body is still the dots should be still. When the body moves the dots vibrate. Some rules might be simple, some could be very complicated. I might also use machine learning in an isolated, logical way. Eg I train the machine to recognise that specific gesture X generates specific output Y.
-The logical approach is not just between me and the machine. It also applies to working with the performer or participant. If I can tell the machine that jumping generates a sound I can also tell that to the performer. We have the OPTION of building choreography based on this logic (or vice versa, intending that certain choreography is interpreted logically by the machine).
-The logical approach to mapping also applies to the audience. The audience can see that jumping generates a sound. Of course if we choose we can hide the logic from the audience, make rules so complicated they can’t work them out. The point is that if we’re working with coded logical interaction we have the OPTION of showing those logical relationships to the audience.
The approach I want to experiment with is offered by neural networks. I want to call it something like the ‘broad description’ method.
- Neural networks are trained with an input and an output, but we don’t have to code or describe logically what the relationships are in the middle.
- This is why neural networks are suited to tasks that defy normal logical description. Eg it’s very hard to describe what every cat in the world could look like in a photo, but if you train a neural network with enough photos of cats it will become really good at that job.
- I mentioned above you can use neural networks as part of a ‘traditional’ logical mapping system. This is the case when we train specific neural networks for highly logical, specific tasks. EG when gesture X is recognised we turn the stage green. This is more or less just an enhanced traditional process and whilst it still offers very exciting possibilities it’s not what this experiment is about.
- Instead what I want to explore is training a neural network with very minimal logical decisions about how the input and the output should map. Instead I propose to simply set up compelling configurations of performer and matching them with compelling outputs in the interactive environment. I.E only providing it with very broad logic. Then once the training is complete we see what it generates with new inputs (Eg we dance in front of it and see what happens).
- The question is can we get a sense of human expression in the output using this method? Or will it seem random and disconnected?
- To be clear the inputs and outputs can have broad logic, we can say that a subjectively neutral performer position should match a subjectively neutral output position. But we aren’t going to do any fine decision mapping logic. The point of the exercise isn’t about making or not making the decisions, it’s about not having to spend the TIME making those finely grained logic decisions. We describe what we find compelling to the machine by showing it, not by coding thousands of tiny rules.
- If successful it could prove to be an interesting technique for rapidly creating performing environments in the rehearsal room or for standalone installations.
For input we will use the 25 skeleton joints from a Kinect2 sensor. This is a reasonably common sensor in the interactive arts world.
For mapping we will use by Dr. Rebecca Fiebrink.
Wekinator is free and available for Mac, Windows and Linux.
It allows simple OSC IO for working with neural networks.