Details
-
New Feature
-
Resolution: Done
-
Major
-
0.2.1
-
None
-
Unknown
-
Description
We need an evaluation framework to test how well the system works for different types of tasks.
We need to develop or use a framework that supports:
- List(s) of tasks/prompts for different types of tasks
- Example content to index as context for these tasks
- Storing answers of the LLM
- Automated evaluation of the answers with another LLM
- Manual evaluation by a human
- Storing evaluation results
- Generating visualizations how well the LLM performed on different tasks