diff --git a/README.md b/README.md index 7b7f3d29..eef2ae53 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,171 @@ -# LightRAG +# LightRAG: Simple and Fast Retrieval-Augmented Generation + + + + +This repository hosts the code of LightRAG. The structure of this code is based on [nano-graphrag](https://github.com/gusye1234/nano-graphrag). +## Install + +* Install from source + +``` +cd LightRAG +pip install -e . +``` +* Install from PyPI +``` +pip install lightrag-hku +``` + +## Quick Start + +* Set OpenAI API key in environment: `export OPENAI_API_KEY="sk-...".` +* Download the demo text "A Christmas Carol by Charles Dickens" +``` +curl https://raw.githubusercontent.com/gusye1234/nano-graphrag/main/tests/mock_data.txt > ./book.txt +``` +Use the below python snippet: + +``` +from lightrag import LightRAG, QueryParam + +rag = LightRAG(working_dir="./dickens") + +with open("./book.txt") as f: + rag.insert(f.read()) + +# Perform naive search +print(rag.query("What are the top themes in this story?", param=QueryParam(mode="naive"))) + +# Perform local search +print(rag.query("What are the top themes in this story?", param=QueryParam(mode="local"))) + +# Perform global search +print(rag.query("What are the top themes in this story?", param=QueryParam(mode="global"))) + +# Perform hybird search +print(rag.query("What are the top themes in this story?", param=QueryParam(mode="hybird"))) +``` +Batch Insert +``` +rag.insert(["TEXT1", "TEXT2",...]) +``` +Incremental Insert + +``` +rag = LightRAG(working_dir="./dickens") + +with open("./newText.txt") as f: + rag.insert(f.read()) +``` +## Evaluation +### Dataset +The dataset used in LightRAG can be download from [TommyChien/UltraDomain](https://huggingface.co/datasets/TommyChien/UltraDomain). + +### Generate Query +LightRAG uses the following prompt to generate high-level queries, with the corresponding code located in `example/generate_query.py`. +``` +Given the following description of a dataset: + +{description} + +Please identify 5 potential users who would engage with this dataset. For each user, list 5 tasks they would perform with this dataset. Then, for each (user, task) combination, generate 5 questions that require a high-level understanding of the entire dataset. + +Output the results in the following structure: +- User 1: [user description] + - Task 1: [task description] + - Question 1: + - Question 2: + - Question 3: + - Question 4: + - Question 5: + - Task 2: [task description] + ... + - Task 5: [task description] +- User 2: [user description] + ... +- User 5: [user description] + ... +``` + + ### Batch Eval +To evaluate the performance of two RAG systems on high-level queries, LightRAG uses the following prompt, with the specific code available in `example/batch_eval.py`. +``` +---Role--- +You are an expert tasked with evaluating two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**. +---Goal--- +You will evaluate two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**. + +- **Comprehensiveness**: How much detail does the answer provide to cover all aspects and details of the question? +- **Diversity**: How varied and rich is the answer in providing different perspectives and insights on the question? +- **Empowerment**: How well does the answer help the reader understand and make informed judgments about the topic? + +For each criterion, choose the better answer (either Answer 1 or Answer 2) and explain why. Then, select an overall winner based on these three categories. + +Here is the question: +{query} + +Here are the two answers: + +**Answer 1:** +{answer1} + +**Answer 2:** +{answer2} + +Evaluate both answers using the three criteria listed above and provide detailed explanations for each criterion. + +Output your evaluation in the following JSON format: + +{{ + "Comprehensiveness": {{ + "Winner": "[Answer 1 or Answer 2]", + "Explanation": "[Provide explanation here]" + }}, + "Empowerment": {{ + "Winner": "[Answer 1 or Answer 2]", + "Explanation": "[Provide explanation here]" + }}, + "Overall Winner": {{ + "Winner": "[Answer 1 or Answer 2]", + "Explanation": "[Summarize why this answer is the overall winner based on the three criteria]" + }} +}} +``` +## Code Structure + +``` +. +├── examples +│ ├── batch_eval.py +│ ├── generate_query.py +│ ├── insert.py +│ └── query.py +├── lightrag +│ ├── __init__.py +│ ├── base.py +│ ├── lightrag.py +│ ├── llm.py +│ ├── operate.py +│ ├── prompt.py +│ ├── storage.py +│ └── utils.jpeg +├── LICENSE +├── README.md +├── requirements.txt +└── setup.py +``` ## Citation -## Acknowledgement -The structure of this code is based on [nano-graphrag](https://github.com/gusye1234/nano-graphrag). \ No newline at end of file +``` +@article{guo2024lightrag, +title={LightRAG: Simple and Fast Retrieval-Augmented Generation}, +author={Zirui Guo and Lianghao Xia and Yanhua Yu and Tu Ao and Chao Huang}, +year={2024}, +eprint={}, +archivePrefix={arXiv}, +primaryClass={cs.IR} +} +``` + diff --git a/examples/batch_eval.py b/examples/batch_eval.py index 753ecb7d..4601d267 100644 --- a/examples/batch_eval.py +++ b/examples/batch_eval.py @@ -6,8 +6,8 @@ import jsonlines from openai import OpenAI -def batch_eval(query_file, result1_file, result2_file, output_file_path, api_key): - client = OpenAI(api_key=api_key) +def batch_eval(query_file, result1_file, result2_file, output_file_path): + client = OpenAI() with open(query_file, 'r') as f: data = f.read() diff --git a/examples/generate_query.py b/examples/generate_query.py index 4f694f31..0ae82f40 100644 --- a/examples/generate_query.py +++ b/examples/generate_query.py @@ -2,7 +2,7 @@ import os from openai import OpenAI -os.environ["OPENAI_API_KEY"] = "" +# os.environ["OPENAI_API_KEY"] = "" def openai_complete_if_cache( model="gpt-4o-mini", prompt=None, system_prompt=None, history_messages=[], **kwargs diff --git a/examples/insert.py b/examples/insert.py index d0689bae..25c3cdda 100644 --- a/examples/insert.py +++ b/examples/insert.py @@ -3,7 +3,7 @@ import sys from lightrag import LightRAG -os.environ["OPENAI_API_KEY"] = "" +# os.environ["OPENAI_API_KEY"] = "" WORKING_DIR = "" diff --git a/examples/query.py b/examples/query.py index b7de519b..00c902eb 100644 --- a/examples/query.py +++ b/examples/query.py @@ -3,7 +3,7 @@ import sys from lightrag import LightRAG, QueryParam -os.environ["OPENAI_API_KEY"] = "" +# os.environ["OPENAI_API_KEY"] = "" WORKING_DIR = "" diff --git a/lightrag/__pycache__/__init__.cpython-310.pyc b/lightrag/__pycache__/__init__.cpython-310.pyc deleted file mode 100644 index 185bed6d..00000000 Binary files a/lightrag/__pycache__/__init__.cpython-310.pyc and /dev/null differ diff --git a/lightrag/__pycache__/base.cpython-310.pyc b/lightrag/__pycache__/base.cpython-310.pyc deleted file mode 100644 index 4e0f8ec9..00000000 Binary files a/lightrag/__pycache__/base.cpython-310.pyc and /dev/null differ diff --git a/lightrag/__pycache__/lightrag.cpython-310.pyc b/lightrag/__pycache__/lightrag.cpython-310.pyc deleted file mode 100644 index c378b67e..00000000 Binary files a/lightrag/__pycache__/lightrag.cpython-310.pyc and /dev/null differ diff --git a/lightrag/__pycache__/llm.cpython-310.pyc b/lightrag/__pycache__/llm.cpython-310.pyc deleted file mode 100644 index 6d9fc0b2..00000000 Binary files a/lightrag/__pycache__/llm.cpython-310.pyc and /dev/null differ diff --git a/lightrag/__pycache__/myrag.cpython-310.pyc b/lightrag/__pycache__/myrag.cpython-310.pyc deleted file mode 100644 index 0b9c34b8..00000000 Binary files a/lightrag/__pycache__/myrag.cpython-310.pyc and /dev/null differ diff --git a/lightrag/__pycache__/operate.cpython-310.pyc b/lightrag/__pycache__/operate.cpython-310.pyc deleted file mode 100644 index 2a4942ec..00000000 Binary files a/lightrag/__pycache__/operate.cpython-310.pyc and /dev/null differ diff --git a/lightrag/__pycache__/prompt.cpython-310.pyc b/lightrag/__pycache__/prompt.cpython-310.pyc deleted file mode 100644 index 9db6378c..00000000 Binary files a/lightrag/__pycache__/prompt.cpython-310.pyc and /dev/null differ diff --git a/lightrag/__pycache__/storage.cpython-310.pyc b/lightrag/__pycache__/storage.cpython-310.pyc deleted file mode 100644 index e7f50365..00000000 Binary files a/lightrag/__pycache__/storage.cpython-310.pyc and /dev/null differ diff --git a/lightrag/__pycache__/utils.cpython-310.pyc b/lightrag/__pycache__/utils.cpython-310.pyc deleted file mode 100644 index 248e9add..00000000 Binary files a/lightrag/__pycache__/utils.cpython-310.pyc and /dev/null differ diff --git a/setup.py b/setup.py index 0852ea5a..df1c3cf4 100644 --- a/setup.py +++ b/setup.py @@ -21,7 +21,7 @@ with open("./requirements.txt") as f: deps.append(line.strip()) setuptools.setup( - name="lightrag", + name="light-rag", url=vars2readme["__url__"], version=vars2readme["__version__"], author=vars2readme["__author__"],