I like the differential equations view. People tend to add a disclaimed that it's the "small step size limit", but turns out this approximation works quite well even for large step size in high dimensions. Step size is bounded by largest eigenvalue, but if most of the mass comes from remaining dimensions, the "large" step size is actually quite small. I did some visualizations on this a while back https://machine-learning-etc.ghost.io/gradient-descent-linear-update/

This is not a trivial step, it involves an argument using the Picard-Lindelöf theorem about the existence and uniqueness of IVP solutions. Gist is, if x(t) is, say, increasing and bounded, than it must have an asymptotic limit. In turn, the limit must be an equilibrium solution, as guaranteed by the Picard-Lindelöf theorem.

I omitted the technical details, because this is quite an intuitive result, and the Picard-Lindelöf theorem is way out of the article's scope.

Well explained 💯🔥

I like the differential equations view. People tend to add a disclaimed that it's the "small step size limit", but turns out this approximation works quite well even for large step size in high dimensions. Step size is bounded by largest eigenvalue, but if most of the mass comes from remaining dimensions, the "large" step size is actually quite small. I did some visualizations on this a while back https://machine-learning-etc.ghost.io/gradient-descent-linear-update/

Really cool visualizations, thanks for sharing!

I think there's a typo in the derivative definition: isn't the limit supposed to go to zero?

Thanks, you are correct! Fixing it now.

I didn't get how you reached to the equations for the `monotonicity describes long-term behavior`.

This is not a trivial step, it involves an argument using the Picard-Lindelöf theorem about the existence and uniqueness of IVP solutions. Gist is, if x(t) is, say, increasing and bounded, than it must have an asymptotic limit. In turn, the limit must be an equilibrium solution, as guaranteed by the Picard-Lindelöf theorem.

I omitted the technical details, because this is quite an intuitive result, and the Picard-Lindelöf theorem is way out of the article's scope.