I'm not an expert on this, but my feeling is that we have in some ways reached "Peak LLMs" already. LLMs are inherently limited by their probabilistic nature. They just go by likelihoods of "which word is likely to come next, given what was written before". These systems have no clue whether what they generate is true or complete BS.
It's amazing how far this gets you, but I think these systems need to be augmented by forms of symbolic reasoning to make real progress.
One reason why ChatGPT et al are so good at mathematical tasks is that most tasks we can think of are not genuinely new but just variants of tasks where many solutions exist in the training set. If you ask them something genuinely new, they usually fail.