7 Comments

Well explained 💯🔥

Expand full comment

I like the differential equations view. People tend to add a disclaimed that it's the "small step size limit", but turns out this approximation works quite well even for large step size in high dimensions. Step size is bounded by largest eigenvalue, but if most of the mass comes from remaining dimensions, the "large" step size is actually quite small. I did some visualizations on this a while back https://machine-learning-etc.ghost.io/gradient-descent-linear-update/

Expand full comment

Really cool visualizations, thanks for sharing!

Expand full comment

I think there's a typo in the derivative definition: isn't the limit supposed to go to zero?

Expand full comment

Thanks, you are correct! Fixing it now.

Expand full comment

I didn't get how you reached to the equations for the `monotonicity describes long-term behavior`.

Expand full comment

This is not a trivial step, it involves an argument using the Picard-Lindelöf theorem about the existence and uniqueness of IVP solutions. Gist is, if x(t) is, say, increasing and bounded, than it must have an asymptotic limit. In turn, the limit must be an equilibrium solution, as guaranteed by the Picard-Lindelöf theorem.

I omitted the technical details, because this is quite an intuitive result, and the Picard-Lindelöf theorem is way out of the article's scope.

Expand full comment