Discussion about this post

User's avatar
Dmitry's avatar
2dEdited

I looked for prior research: https://chatgpt.com/share/e/6853eb75-9f34-8008-b4aa-029eee48ab33

FuseLLM https://www.superannotate.com/blog/fusellm tried similar approach, although they are not focused on writing alone. I wonder why it didn't go further and we are still using single-model in most cases.

WETT https://www.typetone.ai/blog/wett-benchmark seems to be the closest to define 'quality' I wonder if we could use something like it to show that a multi-model beats single-model. Unfortunately as it seems Typetone doesn't publish their dataset and exact assessment formulas, but perhaps there exists a similar open benchmark that we could use?

ROUGE metric seems to be the most common in the industry. Perhaps a dataset with source texts and high-quality summaries could be used - then apply ROUGE metric and see how close are the multi-model summaries are to human-created references.

https://github.com/lechmazur/writing - this is an interesting approach where seven LLMs grade each story on 16 questions. And it is opensource - so we can reproduce it. I wonder how multi-model writing would stack rank there.

Expand full comment
Dmitry's avatar
2dEdited

The "quality" is mentioned 29 times, but I don't see a formal definition/formula to measure it. How did you measure?

Was this article also produced using the same framework? What is its quality score?

I think it is below 98% as it has many issues:

1) I think it repeats unsupported statements like a mantra - it mentions "98%" 15 times. What about 97.99% - is this score not enough? Why?

2) I think it makes claims that are too generic and thus below the standards of a scientific publication.

3) There should be a clean separation between facts and conclusions. Let the readers make their own conclusions from the well presented facts. Best if the facts are reproducible.

4) I would expect a narrative starting from the problem description and the hypothesis, followed by description of the test datasets, exact quality metrics, and raw results, and then some direct conclusions.

5) Speculations and overhyping the results should be avoided as it dilutes creditability.

As this article is clearly AI-written, it should be easy enough to redo it in a proper way.

Expand full comment
2 more comments...

No posts