Researchers develop an automated benchmark for language-base