You really should be using reasoning models for all of your tests here. The tests you are showing are at this point ~6 months out of date and a lot has happened in that time frame.
Thanks for this interesting post. I think that in the reasoning test the LLM is hindered by the tokenization. I got different results if I prompt 'abbbbaa' and 'a b b b b a a' (gemini flash 2.5).
Thanks for this interesting post. I think that in the reasoning test the LLM is hindered by the tokenization. I got different results if I prompt 'abbbbaa' and 'a b b b b a a' (gemini flash 2.5).
Thanks for this interesting post. I think that in the reasoning test the LLM is hindered by the tokenization. I got different results if I prompt 'abbbbaa' and 'a b b b b a a' (gemini flash 2.5).
Thanks for this interesting post. I think that in the reasoning test the LLM is hindered by the tokenization. I got different results if I prompt 'abbbbaa' and 'a b b b b a a' (gemini flash 2.5).
o3 can easily solve the reasoning tasks
https://chatgpt.com/share/680eb4b9-da30-800e-9fb0-2897e460aeec
https://chatgpt.com/share/680eb58e-aab0-800e-9792-d32ac492d68d
And the abstraction tasks:
https://chatgpt.com/share/680eb613-e890-800e-99d0-f99efa522040
https://chatgpt.com/share/680eb653-0928-800e-b735-5ac3335a7ffb
That's great! Especially the new language one. Although that also got the logical chain wrong.
You really should be using reasoning models for all of your tests here. The tests you are showing are at this point ~6 months out of date and a lot has happened in that time frame.
That's good to know. I'm planning a follow up, will use reasoning models there.
Thanks for this interesting post. I think that in the reasoning test the LLM is hindered by the tokenization. I got different results if I prompt 'abbbbaa' and 'a b b b b a a' (gemini flash 2.5).
'abbbbaa': https://g.co/gemini/share/077d1fc5bd42
'a b b b b a a': https://g.co/gemini/share/c96fb99fac62
Thanks for this interesting post. I think that in the reasoning test the LLM is hindered by the tokenization. I got different results if I prompt 'abbbbaa' and 'a b b b b a a' (gemini flash 2.5).
'abbbbaa': https://g.co/gemini/share/077d1fc5bd42
'a b b b b a a': https://g.co/gemini/share/c96fb99fac62
Thanks for this interesting post. I think that in the reasoning test the LLM is hindered by the tokenization. I got different results if I prompt 'abbbbaa' and 'a b b b b a a' (gemini flash 2.5).
'abbbbaa': https://g.co/gemini/share/077d1fc5bd42
'a b b b b a a': https://g.co/gemini/share/c96fb99fac62
Thanks for this interesting post. I think that in the reasoning test the LLM is hindered by the tokenization. I got different results if I prompt 'abbbbaa' and 'a b b b b a a' (gemini flash 2.5).
'abbbbaa': https://g.co/gemini/share/077d1fc5bd42
'a b b b b a a': https://g.co/gemini/share/c96fb99fac62