Discussion about this post

User's avatar
Dmitry's avatar

I looked for prior research: https://chatgpt.com/share/e/6853eb75-9f34-8008-b4aa-029eee48ab33

FuseLLM https://www.superannotate.com/blog/fusellm tried similar approach, although they are not focused on writing alone. I wonder why it didn't go further and we are still using single-model in most cases.

WETT https://www.typetone.ai/blog/wett-benchmark seems to be the closest to define 'quality' I wonder if we could use something like it to show that a multi-model beats single-model. Unfortunately as it seems Typetone doesn't publish their dataset and exact assessment formulas, but perhaps there exists a similar open benchmark that we could use?

ROUGE metric seems to be the most common in the industry. Perhaps a dataset with source texts and high-quality summaries could be used - then apply ROUGE metric and see how close are the multi-model summaries are to human-created references.

https://github.com/lechmazur/writing - this is an interesting approach where seven LLMs grade each story on 16 questions. And it is opensource - so we can reproduce it. I wonder how multi-model writing would stack rank there.

Expand full comment
Ermanno Attardo's avatar

A useful snapshot of today's models' capabilities and how to combine them.

Expand full comment

No posts