In this episode I have a conversation with Filip Piękniewski, researcher working on computer vision and AI at Koh Young Research America. His adventure with AI started in the 90s and since then a long list of experiences at the intersection of computer science and physics, led him to the conclusion that deep learning might not be sufficient nor appropriate to solve the problem of intelligence, specifically artificial intelligence. I read some of his publications and got familiar with some of his ideas. Honestly, I have been attracted by the fact that Filip does not buy the hype around AI and deep learning in particular. He doesn’t seem to share the vision of folks like Elon Musk who claimed that we are going to see an exponential improvement in self driving cars among other things (he actually said that before a Tesla drove over a pedestrian).
So, you think deep learning is a hyped technology. When have you started thinking so?
I have a somewhat complex love and hate relationship with deep learning. I was raised as a neural networks person in the pre deep learning era and I’ve always been fascinated by this technology. However it always bothered me: how much these models actually have to do with the biological brain. At some point I decided to study neuroscience. Long story short, after reading a sizable load of neuroscience papers and implementing a bunch of biologically detailed models of the cortex I realized how idealized and somewhat detached from neuroscientific reality our machine learning models are. I also learned to appreciate the incredible capabilities of the biological brain. Another illuminating experience was to study the field of robotics and learn first hand how difficult it is to equip a robot with even most rudimentary cognitive abilities. When the deep learning revolution started to catch wind in 2012, I was torn between excitement and the realization that whatever I learned is in no way addressed in the deep learning models. So I knew fairly quickly that these models would be useful for many things, but at the same time it was clear to me, that they are not the solution to autonomy and robotics.
I wonder what your thoughts are about AGI artificial general intelligence 🙂
In all seriousness I think we lack a definition. As embarrassing as it may sound, this field had progressed for years under the flag of artificial intelligence without any satisfactory definition of what it is. The only piece of definition was the Turing test, which essentially states: if you can fool humans to think it is intelligent, then it is intelligent. So over 50 years we had the contest of fooling people into thinking that some stuff is intelligent, primarily to extract money in the forms of grants and investments. But this is more similar to con-art or magic show engineering than it is science.
There is actually an emerging field right now in physics which attempts to view intelligence as a purely physical attribute. In that setting intelligence is an attribute of organisms to allow them to extract energy from the environment to support their survival and reproduction. Such a definition (I stated it in an extremely condensed way) is really useful as it “dehumanises” intelligence and accommodates for a variety of intelligence levels shared by various animals inhabiting this planet (who are unquestionably intelligent, yet all of them would have failed the Turing Test). These are still quite early attempts, but this looks very promising, as it grounds intelligence in the broader picture of thermodynamics and the physics of complex systems.
Well, many of the promises of deep learning haven’t happened yet. One of them is definitely the promise of autonomous vehicles. To be fair, we should say that the brain of a self-driving car is not merely deep neural networks. What do you think is the major fault in the field?
I could summarize it very quickly but then it will take a while to explain: the fault is that we assume that we can make a bunch of assumptions about the environment. In other words, we think that driving is like a game. If we code/learn in all the rules along with some finite number of special cases it will be fine. This is not the case: driving is an open environment, moreover open environment full of other intelligent agents, some of whom might be hostile. What compounds the problem is that with driving, human life is at stake so the bar for safety/liability is set extremely high. The nightmare of every autonomous vehicle engineer is the realization that once the car is fully autonomous it will eventually reach a set of events completely outside of the domain it has been trained in. And since these “tail events” cannot be anticipated in advance, neither can be the car’s behavior and the consequences of such encounter. Both symbolic methods of good old AI as well as deep learning are not immune to this “outside the domain” problem. The fact of the matter is, we don’t know how to build a system that can handle the long tail of complex and unpredictable edge cases of the real world. And that is the central problem of autonomy.
You remind me a lot Nassim Taleb learning from tail events and yeah a lot of current statistical models do not take tails into account. Neither deep learning. Speaking more about how all this is perceived by the media, I found that when researchers announced ImageNet was solved, the media confused it with computer vision. When they announced the successes of deep learning and RL on AlphaGo, Atari games and Dota2, the media confused it with high level intelligence. What do you think will happen next?
It is very hard to predict the future. There could still be some spectacular success stories in deep learning, but I think majority of people who invested money in AI will now be expecting real practical advancements. So no more games (such as go or Dota) but more like autonomous drive coast to coast and so on. I am skeptical as to whether this “next level” of AI wonders could be delivered. At some point the patience of investors will run out and funding for the field will collapse. I actually think there will be quite a few real winners with Deep Learning, e.g. in industrial inspection etc. But these are side bets, the central bet is on autonomy.
Do you think Google and Facebook are rethinking about AI and research in deep learning? Why?
Google and Facebook so far have done an outstanding job making use of Deep learning technology and in particular commercializing it. That said, the use case that they applied it to is not the problematic mission critical context, but a lot more forgiving general improvement of Internet search and user experience (such as targeted advertising etc). The difference is huge – when out of a 1000 images google search returns, one is completely bogus, nothing terrible happens. If a self driving car makes one flawed decision after driving even 100000 miles, the consequences could easily be fatal. For Google and Facebook I think the majority of the low hanging fruit is now gone and there could be a general confusion as to where to go next.
If we look at how the compute requirements of deep learning architectures has increased in the last five years, we get up to about 300 000 times increase in flops/s/day (floating point operations) wrt the models of five years earlier. This means that recent neural networks have several orders of magnitude the number of parameters to train. Are these models several orders of magnitude more powerful than previous ones?
This is an important question. I should start by saying that scalable machine learning is the holy grail that everyone has been waiting for. Pre deep learning methods were horribly unscalable, so when deep learning came along promising ability to train larger and larger neural networks, everyone got justifiably excited. But deep learning did not entirely solve the problem of scalability. The core problem of training a perceptron – vanishing gradient problem – has been improved upon, but has not eliminated completely. We cannot just create a 100 layer perceptron (without bypass connections such as RESNET) and expect it to converge. We cannot just scale AlexNet or VGG by multiplying the number of feature maps times 100 and expect better results, or even convergence in any reasonable amount of time. There are subtle ways in which we can make impressively large instances, but this is not easy, takes research and a lot of experience. So we certainly progressed, but this is not the holy grail of scalability everybody has been waiting for.
Not to mention deep learning heavily relies on large datasets to learn complex patterns that might be useful for real use cases. If we think about the classic example of classifying cats and dogs, the brain of a baby doesn’t need a million images. Probably less than five to understand that’s a cat and not a dog. What are deep learning folks doing wrong on the matter?
Humans and animals learn mostly unsupervised. They just interact with things, observe them and somehow make sense about how objects work. By the time children begin to verbalize and learn labels for things, they already have a very good representations of many attributes of those things. So when a child learns that something is a cat, it already knows it is a animal, furry, with a pair of eyes and spiky ears. Deep net trained on thousands of images of cats never picks up these seemingly obvious attributes. Hence deep nets don’t generalize like humans do. This diagnosis is generally acknowledged by most of the top tier scientists in the field. The 2015 nature paper on deep learning by Lecun, Bengio and Hinton – the founding fathers of deep learning – essentially finished with a paragraph on that. Yann Lecun had been giving talks recently where he compared unsupervised learning to the cake, supervised learning to the icing on the cake and reinforcement learning to a cherry. The problem is, we have the cherry and the icing, but we are missing the cake.
One of the most naive yet most tried strategies by deep learning practitioners is to make a neural model more complex. As a matter of fact, by making a network 1000x bigger we don’t get 1000x better predictions. Also a much bigger model needs a much much bigger training dataset (which means Gigs and Gigs). This brings us to the problem of scalability. Is there a solution to that?
This goes back to unsupervised learning. Labeled data is expensive, but unlabeled data is plentiful and essentially free. If we had a vision system that could just watch youtube and learn all it needs to learn from there, we’d be much better of. This is currently a hot topic, so called unsupervised-pre training or semi-supervised learning. However the models people try to pre-train are the same once they use for inference of the label. My message is slightly more subtle – data is not just statistics. There is dynamics that lead to generation of that data. We should have that in mind when we design our models, to make them in such a way that they could learn the dynamics/physics of the process that generated the data. This is what I think is going on in the brain and this is what allows the brain to generalize in “physically meaningful” directions.
In fact a very common strategy practitioners use is so called end-to-end training that treats data as statistical events and we can summarize like: collect a tons of images and their labels, learn the image-label mapping, tune model. What’s wrong with this approach?
From purely statistical point of view there is nothing wrong with this approach. However we have to be careful what kind of association we expect the system to make. The way humans classify e.g. objects in visual scene is quite complex, e.g. based on affordances, cultural context, scene context, social context a myriad of things. We should ask a question – can any system learn that sort of semantics purely from the samples we provide? A system which is deprived of the knowledge of dynamics, embodiment, any idea about physics and environment. Deep learning shows us that even systems trained in such deprived reality can actually obtain impressive results, but when pushed to the limit these systems turn out brittle and susceptible to failure. We should not fool ourselves that the system we train even on a set of million images can infer from these images all the priors which had lead the humans who labeled the dataset to label it in a particular way.
Not to forget that all validations of such models have been performed on data that indeed have the same statistical distribution of the training data. As soon as the distribution of testing datasets diverge from the one of the training data… models stop performing 🙂 When I started with data science (actually with mathematical statistics), my old professor kept saying get a very very very high dimensional input data, and it is very very very likely to find some funny correlation Then I found a website (Ref.) that collected ridiculous but effective (spurious) correlations and I had the proof that that many times we can find in data whatever we want to. Now with this said, I think that many people are confusing spurious correlations with “intelligence”. It seems like you have a different approach to machine learning. An approach to predict the entire perceptual input along with the labels, in order to make a system that can extract the semantics of the world (rather than spurious correlations) What is it about?
Yes, this property of high dimensional data is called the curse of dimensionality. Unfortunately we cannot visualize how say 100 or 10000 dimensional data looks like, but my hunch is it resembles more a massively convoluted fractal than a nicely separated gaussian clouds we often see on statistical lectures. An interesting phenomenon in deep learning, namely susceptibility to so called adversarial examples indicates that deep learning models latch onto those spurious correlations rather than true semantic relations. I conjecture that those true semantic relations are encoded in a very complex manifold, while what we get from deep learning is very crude separation of the space.
The only way I think we could learn this complex manifold it to constrain our decision boundaries a lot more. Label alone is just several bits; with input of 10000 dimensions this leaves an astronomical number of ways to make a separation into categories. Prediction of input (aka predictive encoder) on the other hand has the same “label” dimension as input. This constrains the decision boundary a lot more. Forces the network to choose “dynamically predictive” representations, which forces them to represent something about the generating process, rather than cherry picking things that happen to correlate with the label..
How different is PVM from a series of autoencoders and lower-dimension autoencoders?
PVM (predictive vision model) is a concept, not so much a single model but an entire family of models based on several core assumptions. In essence is it a set of associative memory units, organized roughly in a hierarchy with feedback, where each unit is tasked with predicting its own input (one or a few steps ahead), compressing the representation of this prediction and sharing it with other units, either as a primary signal for downstream units, or context for anyone else interested (lateral or upstream). It is currently implemented with a large number of shallow perceptrons, but I also recently implemented a version based on Generalized Hebbian Learning with Neural Gas quantization. I’m not sure which implementation will work best, but I think that if the idea is sound it should generally work independent of how exactly the associate memory is implemented. In that way, PVM is completely different than Deep Learning, which is almost exclusively limited to single learning algorithm (backpropagation) and a particular supervised training paradigm.
How does this approach improve scalability?
PVM is a collection of units, each of which is trained independently and locally. There are no global training signals, hence there is no vanishing gradient. The architecture can be scaled both vertically as well as horizontally, and a range of lateral interactions (context and feedback) can be liberally varied. Either way everything converges very well. This freedom to wire connectivity resembles what is found in biological cortex. Each unit can receive massive amount of feedback connectivity, and will automatically select whatever is predictive of its primary signal.
One interpretation of PVM is that it is a massive recurrent neural network with tons of feedback both across time and space, that is self stabilising. It can also be viewed as a deeply nested Simple Recurrent Neural Network (Elman Network), the structure of the network itself resembles a self-similar fractal, which I think is very interesting.
When you apply such a method to computer vision, how robust is this to adversarial examples?
Adversarial examples are typically derived from the gradient in a classification task. Here we have an architecture that is massively recurrent and does not have any particular category it is looking for (no single readout), rather tries to predict values of all its inputs. Hence the typical way of deriving adversarial examples does not apply. That being said, when the system is trained with an additional readout for a task, such as say object tracking, it may get fooled by similarly looking/behaving object, just like humans or animals. I expect that certain adversarial examples that work for humans (optical illusions) could cause similar hallucinations in PVM, but I have not had the resources to study this in detail.
This is very interesting. Even if that were the case, it means that PVM is clearly emulating the biological brain better than deep learning. While creating a predictive model of the sensory input seems to work for vision, how does this new approach generalize to language models?
I think language is far down the road, I’m more concerned with early perception. But I think this general approach is promising, OpenAI had a study some time ago where they trained a neural network on character prediction (treating text as a sequence of charaters) and later found out that internal representations of that network could be decoded to inform sentence sentiment (https://blog.openai.com/unsupervised-sentiment-neuron/). For me early sensory processing is a lot more interesting, for example. how multiple modalities could facilitate cognition. It is for example known that we (humans) perceive the consistency of auditory-visual signals. McGurk effect is one salient example (https://en.wikipedia.org/wiki/McGurk_effect) which comes to mind but there are others. Roughly speaking in McGurk effect one can change auditory percepts by tweaking only the visual part of the content (the way in which the lips of the speaking person move). This is a very impressive illusion; if you want to have a look, just type McGurk effect in youtube. Anyway, this goes to show how vision and audition strongly interact in the human brain to generate a coherent perceptual model. PVM can relatively easily accommodate such phenomena by cross-model prediction, or cross modal feedback. It could probably be incorporated in a deep learning model as well, but this is yet another model for yet another specific application/effect. PVM by its very design seems to accommodate a lot of these thing naturally, hence I think I’m onto something.
And I do too. I’d like to close this episode by quoting you on the deep learning hype and the way researchers and practitioners should really behave. You said, [Gary Marcus] behave[s] like a real scientist, as most so called “deep learning stars” just behave like cheap celebrities.
I should say that I believe that science progresses one funeral at a time. There is a bit of a conflict in every scientist – on one hand we have to have the persistence to push our agenda even against headwinds because we believe in some idea. On the other hand we should remain very critical of the idea at all times and be ready to put it to rest whenever somebody/something falsifies it. I think that criticism is lacking in many people in this field. There is a some amount of false humility, but there is also a huge amount of hubris. I mention Gary Marcus but there are plenty other great scientists who stand against this arrogance: Rodney Brooks – founder of irobot, incredibly accomplished roboticist and AI researcher is the next great example. Benjamin Recht from UC Berkeley who just recently shown a paper in which he studies reproducibility on the CIFAR-10 dataset. Judea Pearl, Michael Jordan, Ali Rahimi, Douglas Hofstadter are the next big names that come into mind. There is a growing number of these voices, all they call for is reason and healthy skepticism. I certainly don’t consider myself the enemy of deep learning. On the contrary, I use it all the time. But I oppose being naive about it and overselling its capabilities by making some outrageous statements about it being the solution to general AI.
- AI winter is well on its way https://blog.piekniewski.info/2018/05/28/ai-winter-is-well-on-its-way/
- Predictive vision http://blog.piekniewski.info/2016/11/04/predictive-vision-in-a-nutshell/
- The Moravec paradox https://en.wikipedia.org/wiki/Moravec%27s_paradox
- Spurious correlations http://www.tylervigen.com/spurious-correlations
- Learning from tail events https://medium.com/@nntaleb/strength-training-is-learning-from-tail-events-7aa2c074569d
- Unsupervised Sentiment Neuron https://blog.openai.com/unsupervised-sentiment-neuron/
- McGurk effect https://en.wikipedia.org/wiki/McGurk_effect