Systems are becoming more and more human. We are gradually endowing them with the ability to do some of what we do so that we no longer have to. A part of that involves giving them some version of the five senses. (If they get a sixth sense, we’re in trouble.)
Arguably the most complex of those senses is the ability to see. Actually, seeing is no big deal: add a camera and you’re there. It’s making sense out of what you see that’s so dang difficult. Work on vision technology has presumably gone under the radar for years in the spooks-and-spies world (either that or Hollywood has been lying to us, which couldn’t possibly be the case). The computation required has kept it on beefy machines even as it came out of the shadows and into plain view.
But now we want our little devices to be able to act like they have their own visual cortexes (not to be confused with ARM Cortexes, although one might be involved in the implementation of the other). Which means not just computation, but computation with performance and low power. In a small footprint. For a low price. No problem.
The topic of embedded vision is the explicit charter of the recently-formed Embedded Vision Alliance, a group that had its first public conference in conjunction with Design East in Boston last month. Various players in this space, all members of the alliance, presented different aspects of the state of the art – discussions that largely presented challenges more than ready solutions.
Many technology areas have their seminal moment or quasi-religious touchstone. For semiconductors it’s Moore’s Law. For touch and other sensor technologies (just to mention a few), it’s the iPhone. Embedded vision has its own such moment of revelation: the Kinect from Microsoft.
While the Kinect was preceded by numerous other sophisticated interactive gaming systems, it was the first to do so on a large scale using only vision – no accelerometers or other motion detectors. Not only did it bring embedded vision into the mainstream, it did so at a reasonable cost. And, once the system was hacked, it became a garage-level platform for experimentation.
So the Kinect is vision’s iPhone (with less salivating and swooning). And, just as chip presentations must reference Moore’s Law, and anything relating to phones has to reference the iPhone, the Kinect is the point of departure for folks in the Embedded Vision Alliance.
It would appear that vision technology can be bifurcated at a particular level of abstraction. Below that point are found relatively well-understood algorithms for such things as face recognition or edge detection. These algorithms are often compute-intensive – or, worse yet, memory-bandwidth-intensive. Not that there’s no more work to do here; presumably people will keep coming up with new ideas, but much of the work is in optimizing the performance of these algorithms on various hardware architectures.
Above this level, things become a lot foggier. This is the realm of high-level interpretation. You’ve identified lots of edges in a frame: so what? What do they mean? This is where the world becomes much less strictly algorithmic and more heuristic. And it’s where a lot of original research takes place. Those two edges that intersect: are they part of the same structure? Is one in front of the other? Is one or both of them moving? Is the ambient light creating shadows that could be misinterpreted as objects in their own right?
While one could argue as to where this point of separation between the algorithmic and the heuristic is, there’s a de-facto point that seems to have found itself at a convenient place: the OpenCV library. This is a highish-level API and library of routines that takes care of the algorithmic bits that have reasonably solid implementations. They then become the building blocks for the fuzzier routines doing the high-level work.
While OpenCV forms a convenient rallying point, and, while it abstracts away a lot of compute-intensive code, it’s no panacea. The libraries were developed with desktop (or bigger) machines in mind. So, for instance, it requires a C++ compiler – something you’re not likely to see much of in the creation of deeply-embedded systems. It’s been developed on the Intel architecture; adapting it to smaller or different embedded architectures will require significant optimization. And many routines rely on floating point math, a capability missing in many embedded systems.
One of the participating companies, Videantis, has taken OpenCV as a transition level one step further: they’ve built hardware acceleration IP that operates at the level of the OpenCV API. This allows them to optimize the implementation of many of the OpenCV routines while letting designers write algorithm code using OpenCV that, in a manner of speaking, requires no porting.
While the guys in white coats work the intelligent algorithms in one room, guys in greasy dungarees are off in the next room trying to figure out the best hardware for running these things. And numerous presentations pointed to the need for a heterogeneous structure for doing this. That means that work can be distributed between a standard CPU, a highly-parallel GPU, a highly-efficient DSP, and an FPGA – or some partial combination thereof.
The need to identify such an architecture is reflected in the Heterogeneous System Architecture (HSA) initiative. In fact, in a separate discussion, Imagination Technologies, one of the HSA founding companies, indicated that pure multicore has run its course – it doesn’t scale beyond the kinds of quad-core engines you see now. That doesn’t necessarily square with the many-core efforts underway, but it’s hard to argue that a single core replicated many times is the optimal answer for all problems. And if you take power into account – critical for embedded and mobile applications – then you pretty much have to tailor the engine to the problem for efficiency.
And power isn’t the only constraint. We saw vision subsystems packed into a cube 1” on a side. And then there’s the price question. The Kinect has thrown down the gauntlet here with a rough $80 bill of materials (BOM). That’s a high-volume, consumer-oriented device. The first of its kind. Which means that cost has one way to go: down. These days, systems are getting closer to $50, but the next generation of systems will need to target a BOM of $15-20 in order to achieve extremely high volumes.
All of which brings the promise of lots of machine eyes all around us. Watching our every move. Wait, I’m starting to freak myself out here – I’m sure they’ll be used only for good. And much of that could probably already be deployed by the Qs and James Bonds of the world if they wanted to. This will move the all-seeing capability out of the hands of only the government and into the corporate and consumer mainstream. OK, now I’m really freaking out. [Deep breaths… oooooooooommmmmm – wait, I guess that, by law, that would be ooooooohhhhhhhhmmmmm in Silicon Valley] As usual, promise and peril going hand-in-hand. Ample fodder for further discussion as events warrant.
More info:
Do you have plans for embedded vision? What are some of the less-obvious challenges you see?