4 Comments
User's avatar
Dmitry's avatar
2dEdited

I looked for prior research: https://chatgpt.com/share/e/6853eb75-9f34-8008-b4aa-029eee48ab33

FuseLLM https://www.superannotate.com/blog/fusellm tried similar approach, although they are not focused on writing alone. I wonder why it didn't go further and we are still using single-model in most cases.

WETT https://www.typetone.ai/blog/wett-benchmark seems to be the closest to define 'quality' I wonder if we could use something like it to show that a multi-model beats single-model. Unfortunately as it seems Typetone doesn't publish their dataset and exact assessment formulas, but perhaps there exists a similar open benchmark that we could use?

ROUGE metric seems to be the most common in the industry. Perhaps a dataset with source texts and high-quality summaries could be used - then apply ROUGE metric and see how close are the multi-model summaries are to human-created references.

https://github.com/lechmazur/writing - this is an interesting approach where seven LLMs grade each story on 16 questions. And it is opensource - so we can reproduce it. I wonder how multi-model writing would stack rank there.

Expand full comment
Dmitry's avatar
3dEdited

The "quality" is mentioned 29 times, but I don't see a formal definition/formula to measure it. How did you measure?

Was this article also produced using the same framework? What is its quality score?

I think it is below 98% as it has many issues:

1) I think it repeats unsupported statements like a mantra - it mentions "98%" 15 times. What about 97.99% - is this score not enough? Why?

2) I think it makes claims that are too generic and thus below the standards of a scientific publication.

3) There should be a clean separation between facts and conclusions. Let the readers make their own conclusions from the well presented facts. Best if the facts are reproducible.

4) I would expect a narrative starting from the problem description and the hypothesis, followed by description of the test datasets, exact quality metrics, and raw results, and then some direct conclusions.

5) Speculations and overhyping the results should be avoided as it dilutes creditability.

As this article is clearly AI-written, it should be easy enough to redo it in a proper way.

Expand full comment
Stanislav Huseletov's avatar

First of all, thank you for your comments, I definitely see where you are coming from. I have addressed some of your comments by tweaking the article's language, and adding minor additional sections for clarity of the explanations.

You are right about the tone, and although it was a choice by me, and not written by AI, I can see how the article can give that impression. Rest assured AI Center of Excellence is focused primarily on very high-level deeply technical articles as you will see come out this week. As for the provocative statements, they were meant to engage people and attract attention to the methodology, which it clearly is doing.

Of course, this overhyped style is something I experimented and will be steering away from in the future, but AI Center of Excellence should address both high-level issues as well as (sometimes) some low-level ideas that would help save a lot of time for every worker.

If you want to make your own conclusions, there is a live demo, which you can see at work linked in the article. I do not claim to have invented the approach or any of the models, the article covers the method and is a wrapper for this methodology to be presented in (albeit maybe too shiny of a wrapper).

Expand full comment
Ermanno Attardo's avatar

A useful snapshot of today's models' capabilities and how to combine them.

Expand full comment