Why does gradient descent work?

Tivadar Danka

Apr 6, 2023

Rolling downhill with dynamical systems

Read →

6 Comments

Jaisurya Prabakaran

Apr 13, 2023

Well explained 💯🔥

Expand full comment

Yaroslav Bulatov

Apr 12, 2023

I like the differential equations view. People tend to add a disclaimed that it's the "small step size limit", but turns out this approximation works quite well even for large step size in high dimensions. Step size is bounded by largest eigenvalue, but if most of the mass comes from remaining dimensions, the "large" step size is actually quite small. I did some visualizations on this a while back https://machine-learning-etc.ghost.io/gradient-descent-linear-update/

Expand full comment

Reply (1)

Tivadar Danka

Apr 13, 2023

Really cool visualizations, thanks for sharing!

Expand full comment

Rabin Adhikari

Apr 7, 2023

I didn't get how you reached to the equations for the `monotonicity describes long-term behavior`.

Expand full comment

Reply (1)

Tivadar Danka

Apr 8, 2023

This is not a trivial step, it involves an argument using the Picard-Lindelöf theorem about the existence and uniqueness of IVP solutions. Gist is, if x(t) is, say, increasing and bounded, than it must have an asymptotic limit. In turn, the limit must be an equilibrium solution, as guaranteed by the Picard-Lindelöf theorem.

I omitted the technical details, because this is quite an intuitive result, and the Picard-Lindelöf theorem is way out of the article's scope.

Expand full comment

Comment removed

Apr 6, 2023

Comment removed

Expand full comment

Reply (1)

Tivadar Danka

Apr 6, 2023

Thanks, you are correct! Fixing it now.

Expand full comment

The Palindrome

Why does gradient descent work?