4 Comments
User's avatar
Tivadar Danka's avatar

That's great! Especially the new language one. Although that also got the logical chain wrong.

Expand full comment
SorenJ's avatar

You really should be using reasoning models for all of your tests here. The tests you are showing are at this point ~6 months out of date and a lot has happened in that time frame.

Expand full comment
Tivadar Danka's avatar

That's good to know. I'm planning a follow up, will use reasoning models there.

Expand full comment