16 Comments
User's avatar
Kurt's avatar

Thank you. I bought your book

Expand full comment
Alberto Gonzalez's avatar

And this newsletter is like an infinite book :)

Expand full comment
Tivadar Danka's avatar

Thanks, it's great to hear! Let me know what you think!

Expand full comment
The one 15's avatar

I can honestly say we can only hide Math for so long Trigonometry is the corner stone of all mathematics withouth it you couldnt navigate space and thats facts the Ancients were smart and showed how math was really applied if we truly move with trig we can showcase how math tuly is it is not magic is just imagination and the way we can move geometricaly in our minds..

Expand full comment
The one 15's avatar

I can honestly say we can only hide Math for so long Trigonometry is the corner stone of all mathematics withouth it you couldnt navigate space and thats facts the Ancients were smart and showed how math was really applied if we truly move with trig we can showcase how math tuly is it is not magic is just imagination and the way we can move geometricaly in our minds..

Expand full comment
The one 15's avatar

I can honestly say we can only hide Math for so long Trigonometry is the corner stone of all mathematics withouth it you couldnt navigate space and thats facts the Ancients were smart and showed how math was really applied if we truly move with trig we can showcase how math tuly is it is not magic is just imagination and the way we can move geometricaly in our minds..

Expand full comment
The one 15's avatar

I can honestly say we can only hide Math for so long Trigonometry is the corner stone of all mathematics withouth it you couldnt navigate space and thats facts the Ancients were smart and showed how math was really applied if we truly move with trig we can showcase how math tuly is it is not magic is just imagination and the way we can move geometricaly in our minds..

Expand full comment
Prithvi Singh's avatar

Hi, is there a way to purchase your book in India? It is available but at an exorbitant price probably because the seller is importing it or something

Thank you for the post

Expand full comment
suman suhag's avatar

The first, and most important thing, to realize about deep learning is that it is not a “deep” subject, meaning that it is a very “shallow” topic with almost no theory underlying it. There are no guarantees of convergence (since we are after all talking about nonlinear optimization in high-dimensional spaces), and no performance guarantees of any kind (say, compared to what you get with other areas of machine learning, like kernel methods, sparse linear models etc.). It’s essentially like woodworking without physics. If you mix this type of polish with that kind of wood, you get this sort of effect. The reason that there invariably has to be a future beyond deep learning is that one cannot build a solid engineering science of machine learning with bricks built out of hay. As Vladimir Vapnik once said, “The most practical thing in the world is good theory”, and that’s currently not available in deep learning. If deep learning is the best solution the machine learning community can do, as a card carrying member of this research community for over 30+ years, I’d have to say we are in serious trouble!

Let’s just take one example, the current rage over generative adversarial models or GANs. There are close to 500+ papers on this topic, and almost 3 dozen variants of GANs with more appearing every week. However, there are barely any papers that show 1) whether GANs will converge reliably when trained (the original GANs do not!) 2) what the sample complexity of GANs are (no one knows) 3) what GANs can and cannot do. There’s as far as I know 1–2 papers that attempt to give a theory of GANs, a particularly nice paper by Sanjeev Arora and colleagues, which is largely a negative result. It shows that the original GAN model does not converge, but that a modified multiple generator/multiple discriminator model might converge, in a very weak sense. Yet, this has not dampened any of the excitement about this model, far from it.

There’s also a collective sense of loss of reality when folks get excited about models like GANs. These models taken thousands and thousands of iterations to converge (when they do, and often, they don’t), and each iteration requires many many passes through the data. At the end of the day, you burn through millions of CPU cycles, and you have to wonder whether after burning all that energy: is the game worth the candle? Where’s all this energy getting us? is it leading us to a solid scientifically based theory of how to build a theory of unsupervised learning? The vast majority of GAN papers are largely empirical, showing cute pictures of what a variant of GAN can do, but the metrics are often either non-existent or somewhat artificial.

So, many of us in the field indeed do look forward to a life beyond deep learning, where we can not only build impressive empirically substantiated learning systems, but also have a solid theory underlying it.

If you want an example of a truly “deep” science, look no further than this year’s Nobel prize for the design of LIGO detectors, capping a 100 years of effort to detect gravitational waves from Einstein’s general relativity theory. We can now detect collisions among black holes 2 billion light years away releasing more energy in one collision than all the energy from all the stars in the observable universe. And there’s a very substantial amount of nontrivial mathematics that went into the building of the LIGO detectors and in advances in general relativity theory.

That’s what a true “deep” learning theory should look like. I am confident that one day, machine learning will get there, but it will take many years of effort, and physicists provide us with an inspiration of what can be achieved.

Expand full comment
suman suhag's avatar

According to the report of the Global State of Enterprise Analytics, more than 60% of the enterprise organizations are working on utilizing the big data and analytics to drive the process & cost efficiency as well as strategy and change.

Thusly, driving undertakings are putting resources into this innovation to drive advancements in gathering new data, joining both outside &internal information and utilize massive Data Analytics to surpass contenders. These pioneers don’t simply grasp analytics arrangements and unique experiences, but also take them to the following pace by joining investigation in different manners like:

Building a quantitative innovation culture.

Making the analytics part of each and every role.

Promoting excellent accessibility and quality.

Effectively and efficiently using the analytics tools for the innovations among the business.

Expand full comment
suman suhag's avatar

Provocative question! Having spent at least 25 years studying RL, ever since my first real job at IBM Research, where I explored the use of methods like Q-learning from 1990–93 to train robots new tasks, I’ve watched the field through its various phases. In the early 1990s, when I got involved, it was restricted to a small handful of aficionados. I organized the first National Science Foundation workshop on RL, to which about 50–60 senior researchers were invited (in 1995).

Gradually, through the early part of the 2000s, the field gained popularity, but never seemed to become a mainstream research topic within ML. Then, wham! Deep Mind did its thingie with the combination of deep learning and RL, applied to a visually appealing domain of Atari video games, and (deep) RL’s popularity went through the roof. Now, it seems all the rage, and certainly, many employers are hiring (in the Bay Area, it’s an area sought after by some of the labs doing autonomous driving). Google paid half a billion Euros for Deep Mind (supposedly!), on the basis of their deep RL Atari demo. So, this looked like a real turning point, and RL came to life!

So, getting back to the question, is RL a “dead end”? In answering this provocative question, one has to clarify one’s point of view. Certainly, from the standpoint of the work going on in Deep Mind and other places on using deep RL to play games like Go or Chess, or train given an accurate simulator of the world for a self-driving car, RL is certainly poised to become well established technology, and its popularity is only going to increase. RL sessions at major AI and ML conferences are very well attended, and RL submissions are definitely increasing. In all these dimensions, RL is very much not at a “dead end”, in fact, its popularity is only increasing.

But, but, …. you knew there was a but coming there!

When you impose on RL the goal of “online learning in real time from the real world”, and not doing millions of simulation steps where agents can be killed thousands of times with no penalty, I fear RL is very much at a dead end. It is not clear to me that any extension of the au courant deep RL methods is going to lead to successes in the real world, in terms of a physical agent that can learn in real time with a small number of examples.

That is, if your goal is to build a model of how humans learn complex skills, such as driving, then RL to me is a very poor explanation of how such skills are acquired. One has to only look at the comparative results reported in the AAAI 2017 paper by Tsividis et al., comparing random humans on Amazon Turk with the best deep RL programs at Atari video games to see where deep RL simply flounders. Humans learn Atari video games, like Frostbite, about 1000x faster than the fastest deep RL methods.

A typical human learned Frostbite in 1 minute with a few hundred examples at most. DQN or other deep RL programs take days with millions of examples. It’s not even close, it’s like another galaxy in terms of the speed of learning differences. So, looking at this paper, I’d have to say I don’t see any way to capture such large differences with any incremental tweaking of deep RL methods, such as being reported annually in ICML or NIPS papers (of which I review a bunch each year, hoping against hope to see a new idea emerge, only to be disappointed!).

So, what’s to be done to “rescue RL”. I’m not sure there’s really a solution out there. I for one have stopped believing that we learn complex skills like driving by something that resembles “pure RL” (that is, from rewards alone). Humans learn to drive because they in fact “know” how to drive even before they even try to drive once. They’ve seen their parents, friends, lovers, Uber drivers, etc. drive many many times, and they’ve seen driving behavior in movies for thousands of hours. So, when they finally get behind the wheel, they instinctively “know” what driving means, but of course, they have never actually controlled a physical car before. So, there is that all important “last mile” of actual driving that needs to be learned.

But, since the driving program is largely already in place, built in by many thousands of hours of observation, not to mention active instruction by a driving teacher or an anxious parent, what needs to be “learned” are a few control parameters that tell the human brain how much to turn the wheel, or press the brake, and more importantly, where to look on the road etc. This is course not trivial, which is why humans take a few weeks to get comfortable behind a wheel, But, if you look at real hours of practice, humans learn to drive in a few hundred hours — for those paying for driving instruction, this is expensive since you are charged by the hour.

Also, all important to remember is that when you impose the condition of learning in the real world, there can be “no cheating”! That is, unlike the ridiculous 2D world of Atari video games, like Enduro, where one is given a 2D highly simplified visual world, and actions are limited to a few discrete choices, humans must drive in the full 3D real world and have the huge task of controlling both legs, both hands, neck, body, etc, many hundreds of continuous degrees of freedom, as well as have to cope with an immense sensory space of stereo vision, and binaural hearing as well.

The only way humans ever learn to drive in a few hundred hours is the simple fact that we already almost know driving, and we have obviously a fully working vision system, so we can read signs, recognize cars and pedestrians, and our hearing system also recognizes sirens, alerts, horns etc. So, if you look at the immensity of the whole driving task, I would claim more than 95% of the driving knowledge is already known, and the small remaining part has to be acquired from practice. This is the only explanation for how humans learn such a complex skill as driving in a few hundred hours. There is NO magic here.

So, in that sense, pure (deep) RL seems like a dead end. The pure (deep) RL problem formulation really does not hold much interest for me any more. What is needed in its place is a more complex model of how learning happens by combining observation, transfer learning, and many other types of behavior cloning from observed demonstration to the learner, and finally being able to take this knowledge, and then improve it with some actual trial and error RL.

One can generalize this to other modes of learning as well. The late Richard Feynman, who was arguably the most influential physicist after the 2nd world war, taught a classic introductory course at Caltech, which led to probably the best selling college textbook of all time, the Feynman Lectures on Physics (still being sold almost 60 years later, in the nth edition). When he looked at how students handled his problem sets, Feynman was ultimately disappointed. He realized that even the extremely bright students at Caltech could not “learn” physics, simply sitting in his class and absorbing his lectures. So, he ended his preface to the textbook with a disappointing conclusion, quoting Gibbons (which I had long ago memorized):

“The power of instruction is seldom of much efficacy, except in those happy dispositions where it is almost superfluous”.

I realized the wisdom of this saying after spending two decades or more teaching machine learning to graduate students at several institutions. It seems almost paradoxical, but what Gibbons is saying, and what Feynman and I both discovered is that learning from teaching only works when the learner “almost already knows” the subject.

But, this is precisely what the various theoretical formulations of ML predict must be the case, there is no “free lunch” in terms of being able to learn. Deep Mind’s DQN network takes millions and millions of steps to learn an apparently trivial task (to humans) like Frostbite, because initially DQN knows nothing. Humans, in contrast, learn Frostbite in < 1 minute because they have spent many many hours building the background needed to learn Frostbite so quickly (e.g, vision, hand eye coordination, general game playing strategies).

Unfortunately, the prevailing currents in the field, at venues like “NeurIPS” (NIPS) and ICML and AAAI conferences, tend to “glorify” knowledge-free learning, so you end up with hundreds, if not thousands, of (deep) RL papers, where agents take millions of time steps to learn apparently simple tasks. To me, this approach is ultimately a “dead end”, if your goal is to develop a computational model of how humans learn.

Expand full comment
suman suhag's avatar

The best way, in my view, to understand a field is to understand the reason why the field exists in the first place, Why do we need a field like machine learning? In short, what problems does it solve and why?

Let’s start with an analogy, something you do practically every morning: you wake up and get ready to go to work. What problems do you need to solve? For one, you need to put on some clothes to protect your body from the weather and your feet against the rough surfaces you might encounter. You need to perhaps cover your head with a hat or a scarf and protect your eyes with sunglasses against the harsh rays of the sun. These are the problems we need to solve in getting dressed.

Algorithms are like clothes and shoes, hats and scarves and sunglasses, continuing the analogy from above. You could wear sneakers, dress shoes, or high heels. You could wear a T-shirt, a dress shirt, a full length skirt and so on. Clothes and shoes are ways to solve the problem of dressing up for work. Which clothes you wear and what shoes you put on may vary, depending on the occasion and the weather. Similarly, which machine learning algorithm you use may depend on the problem, the data, the distribution of instances etc. The lesson from the fashion industry is quite apt and worth remembering. Problems never change (you always need something to cover your feet), but algorithms change often (new styles of clothes and shoes get created every week or month). Don’t waste time learning fashionable solutions when they will become like yesterday’s newspaper. Problems last, algorithms don’t!

There’s a tendency, unfortunately, of recommending universal solutions to machine learning these days (e.g., learn TensorFlow and code up every algorithm as stochastic gradient descent using a deep neural net). To me, this makes just about as much sense as wrapping yourself up in your bedsheets to go to work. Sure, it covers most parts of your body, and probably could do the job, but it’s a one size fits all approach that neither shows any style or taste, nor any understanding of the machine learning (or dressing) problem.

The machine learning community has spent over four decades trying to understand how to pose the problem of machine learning. Start by understanding a few of these formulations, and resist the temptation to view every machine learning problem through a simplified lens (like supervised learning, one of dozens of ways of posing ML problems). The major categories include unsupervised learning, the most important, followed by reinforcement learning (learning by trial and error, the most widely prevalent in children after unsupervised learning), and finally supervised learning (which occurs rather late, because it requires labels and language, which young children mostly lack in early years). Transfer learning is growing in importance as labeled data is expensive and hard to collect for every new problem. There’s lifelong learning, and online learning, and so on. One of the deepest and most interesting areas of machine learning is the theory of probably approximately correct (or PAC) learning. This is a fascinating area, which looks at the problem of how we can give guarantees that a machine learning algorithm will work reliably or will produce a sufficiently accurate answer. Whether you understand PAC learning or not tells me if you are a ML scientist, or an ML engineer.

The most basic formulation of machine learning, and the one that gets short shrift in many popular expositions, is learning a “representation”. What does this even mean? Take the number “three”. I could write it using three strokes III, or as 11, or as 3. These correspond to the unary, binary, and decimal representations. The latter was invented in India more than 2000 years ago. Remarkably, the Greeks, for all their wisdom, never discovered the use of 0 (zero), and never invented decimal numbers. Claude Shannon, the famed inventor of information theory, popularized binary representations for computers in a famous MS thesis at MIT in the early part of the 20th century.

What does it mean for a computer to “learn” a representation? Take a selfie and imagine writing a program to identify your image (or your spouse or your pet) from the image. The phone uses one representation for the image (usually something like JPEG, which is mathematically called the Fourier basis). It turns out this basis is a terrible representation for machine learning. There are many better representations, and new ones get invented all the time. A representation is like the material that makes up your dress. There’s cotton and polyester and wool and nylon. Each of these has its strengths and weaknesses. Similarly, different representations of input data have their pros and cons. Resist the temptation to view one representation as superior to all the others.

Humans spend most of their day solving sequential tasks (driving, eating, typing, walking, etc.). All of these require making a sequence of decisions, and learning such tasks involves reinforcement learning. Without RL, we would not get very far. Sadly, all textbooks of ML ignore this most basic and important area, to their discredit. Fortunately, there are excellent specialized books that cover this area.

Let me end with two famous maxims from the legendary physicist Richard Feynman about learning a topic. First, he said: “What I cannot create, I do not understand”. What he meant that unless you can recreate an idea or an algorithm yourself, you probably haven’t understood it well enough. Second, he said: “Know how to solve every problem that has already been solved”. This second maxim is to make sure you understand what has been done previously. For most of us, these are hard principles to follow, but to the extent you can follow them, you will find your way to complete mastery over any field, including machine learning. Good luck!

Expand full comment
suman suhag's avatar

Hidden Markov Models can be used to generate a language, that is, list elements from a family of strings. For example, if you have a HMM that models a set of sequences, you would be able to generate members of this family, by listing sequences that would fall into the group of sequences we are modelling.

Neural Networks, take an input from a high-dimensional space and simply map it to a lower dimensional space (the way that the Neural Networks map this input is based on the training, its topology and other factors). For example, you might take a 64-bit image of a number and map it to a true / false value that describes whether this number is 1 or 0.

Whilst both methods are able to (or can at least try to) discriminate whether an item is a member of a class or not, Neural Networks cannot generate a language as described above.

There are alternatives to Hidden Markov Models available, for example you might be able to use a more general Bayesian Network, a different topology or a Stochastic Context-Free Grammar (SCFG) if you believe that the problem lies within the HMMs lack of power to model your problem - that is, if you need an algorithm that is able to discriminate between more complex hypotheses and/or describe the behaviour of data that is much more complex.

What is hidden and what is observed: The thing that is hidden in a hidden Markov model is the same as the thing that is hidden in a discrete mixture model, so for clarity, forget about the hidden state's dynamics and stick with a finite mixture model as an example. The 'state' in this model is the identity of the component that caused each observation. In this class of model such causes are never observed, so 'hidden cause' is translated statistically into the claim that the observed data have marginal dependencies which are removed when the source component is known. And the source components are estimated to be whatever makes this statistical relationship true. The thing that is hidden in a feedforward multilayer neural network with sigmoid middle units is the states of those units, not the outputs which are the target of inference. When the output of the network is a classification, i.e., a probability distribution over possible output categories, these hidden units values define a space within which categories are separable. The trick in learning such a model is to make a hidden space (by adjusting the mapping out of the input units) within which the problem is linear. Consequently, non-linear decision boundaries are possible from the system as a whole.

Generative versus discriminative: The mixture model (and HMM) is a model of the data generating process, sometimes called a likelihood or 'forward model'. When coupled with some assumptions about the prior probabilities of each state you can infer a distribution over possible values of the hidden state using Bayes theorem (a generative approach). Note that, while called a 'prior', both the prior and the parameters in the likelihood are usually learned from data. In contrast to the mixture model (and HMM) the neural network learns a posterior distribution over the output categories directly (a discriminative approach). This is possible because the output values were observed during estimation. And since they were observed, it is not necessary to construct a posterior distribution from a prior and a specific model for the likelihood such as a mixture. The posterior is learnt directly from data, which is more efficient and less model dependent.

Mix and match: To make things more confusing, these approaches can be mixed together, e.g. when mixture model (or HMM) state is sometimes actually observed. When that is true, and in some other circumstances not relevant here, it is possible to train discriminatively in an otherwise generative model. Similarly it is possible to replace the mixture model mapping of an HMM with a more flexible forward model, e.g., a neural network.

Expand full comment
suman suhag's avatar

It can be definitely OK, but it depends on what you're trying to do, and what "reality" is (i.e. what's the most correct answer). Adding variables that aren't needed won't help your model (particularly your estimates), but also might not matter much (e.g. predictions). However, removing variables that are real, even if they don't meet significance, can really mess up your model.

Here's a few rules of thumb:

Include the variable if it is of interest before hand, or you want a direct estimate of its effect. If your business collaborators say to put it in, put it in. If they're looking for estimates of the holiday effects, put it in (although there might be some debate as to whether you should look at each holiday individually).

Include the variable if you have some prior knowledge that it should be relevant. This can be misleading, because it's a confirmation bias, but I'd say in most cases it makes sense to do so. Particularly for holiday effects (I assume this is something like sales or energy consumption), these are well-known and documented, and those small but not-statistically-significant are real.

In general practice (i.e. most real world situations), it's better to have a slightly overspecified model than an underspecified one. This is particularly true for the purposes of prediction, because the response remains unbiased (i.e. determining the response of Y). This rule is very conditional, but the other bullets that favor overspecification tend to be more common in practice, especially in the business/applied world. Note that by saying that, I bring it back to the second bullet point, emphasizing business experience.

If you want a model that can generalize to many cases, you should favor fewer variables. Overfitting works, but it tends to make your model only work for a narrow inference space (i.e. the one reflected by your sample).

If you need precise (low variance) estimates, use fewer variables.

Just to re-emphasize; these are rules of thumb. There are plenty of exceptions. Judging by the limited information you've provided, you probably should include the non-significant "holiday" variable.

I've seen many saturated models (every term included) that perform extremely well. This isn't always true, but this works because, in a lot of business problems, reality is a complex response (so you should expect a lot of variables to be present), in addition to the lack of statistical bias from adding all these variables. Less relevant to this question, but relevant to this answer is that "Big data" also captures the power of the law of large numbers and the central limit theorem.

Variable selection is a long and complicated topic. Look up descriptions of the drawbacks of underspecification vs. overspecification, while remembering that the "right" model is the best - but unachievable. Determine if your interest is in the mean or the variance. There's a lot of focus on variances, especially in teaching and academia...but in practice and in most business settings, most people are more interested in the mean! This goes back to why overspecification in most real world cases should probably be favored.

Expand full comment
Agrim Joshi's avatar

Thank you for sharing this!

PS: I love Math even more when it is taught by You and Dr. Mike Cohen

Expand full comment
Alberto's avatar

Thank you for this !

Expand full comment