Merge pull request #56 from sank8-2/dev

chore: added pre-commit-hooks and ruff formatting for commit-hooks
This commit is contained in:
zrguo
2024-10-19 20:44:11 +08:00
committed by GitHub
26 changed files with 635 additions and 393 deletions

View File

@@ -16,16 +16,16 @@
<a href="https://pypi.org/project/lightrag-hku/"><img src="https://img.shields.io/pypi/v/lightrag-hku.svg"></a>
<a href="https://pepy.tech/project/lightrag-hku"><img src="https://static.pepy.tech/badge/lightrag-hku/month"></a>
</p>
This repository hosts the code of LightRAG. The structure of this code is based on [nano-graphrag](https://github.com/gusye1234/nano-graphrag).
![请添加图片描述](https://i-blog.csdnimg.cn/direct/b2aaf634151b4706892693ffb43d9093.png)
</div>
## 🎉 News
## 🎉 News
- [x] [2024.10.18]🎯🎯📢📢Weve added a link to a [LightRAG Introduction Video](https://youtu.be/oageL-1I0GE). Thanks to the author!
- [x] [2024.10.17]🎯🎯📢📢We have created a [Discord channel](https://discord.gg/mvsfu2Tg)! Welcome to join for sharing and discussions! 🎉🎉
- [x] [2024.10.16]🎯🎯📢📢LightRAG now supports [Ollama models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
- [x] [2024.10.15]🎯🎯📢📢LightRAG now supports [Hugging Face models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
- [x] [2024.10.16]🎯🎯📢📢LightRAG now supports [Ollama models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
- [x] [2024.10.15]🎯🎯📢📢LightRAG now supports [Hugging Face models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
## Install
@@ -92,7 +92,7 @@ print(rag.query("What are the top themes in this story?", param=QueryParam(mode=
<details>
<summary> Using Open AI-like APIs </summary>
LightRAG also support Open AI-like chat/embeddings APIs:
LightRAG also supports Open AI-like chat/embeddings APIs:
```python
async def llm_model_func(
prompt, system_prompt=None, history_messages=[], **kwargs
@@ -129,7 +129,7 @@ rag = LightRAG(
<details>
<summary> Using Hugging Face Models </summary>
If you want to use Hugging Face models, you only need to set LightRAG as follows:
```python
from lightrag.llm import hf_model_complete, hf_embedding
@@ -145,7 +145,7 @@ rag = LightRAG(
embedding_dim=384,
max_token_size=5000,
func=lambda texts: hf_embedding(
texts,
texts,
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
embed_model=AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
)
@@ -157,7 +157,7 @@ rag = LightRAG(
<details>
<summary> Using Ollama Models </summary>
If you want to use Ollama models, you only need to set LightRAG as follows:
```python
from lightrag.llm import ollama_model_complete, ollama_embedding
@@ -171,7 +171,7 @@ rag = LightRAG(
embedding_dim=768,
max_token_size=8192,
func=lambda texts: ollama_embedding(
texts,
texts,
embed_model="nomic-embed-text"
)
),
@@ -196,14 +196,14 @@ with open("./newText.txt") as f:
```
## Evaluation
### Dataset
The dataset used in LightRAG can be download from [TommyChien/UltraDomain](https://huggingface.co/datasets/TommyChien/UltraDomain).
The dataset used in LightRAG can be downloaded from [TommyChien/UltraDomain](https://huggingface.co/datasets/TommyChien/UltraDomain).
### Generate Query
LightRAG uses the following prompt to generate high-level queries, with the corresponding code located in `example/generate_query.py`.
LightRAG uses the following prompt to generate high-level queries, with the corresponding code in `example/generate_query.py`.
<details>
<summary> Prompt </summary>
```python
Given the following description of a dataset:
@@ -228,18 +228,18 @@ Output the results in the following structure:
...
```
</details>
### Batch Eval
To evaluate the performance of two RAG systems on high-level queries, LightRAG uses the following prompt, with the specific code available in `example/batch_eval.py`.
<details>
<summary> Prompt </summary>
```python
---Role---
You are an expert tasked with evaluating two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
---Goal---
You will evaluate two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
You will evaluate two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
- **Comprehensiveness**: How much detail does the answer provide to cover all aspects and details of the question?
- **Diversity**: How varied and rich is the answer in providing different perspectives and insights on the question?
@@ -303,7 +303,7 @@ Output your evaluation in the following JSON format:
| **Empowerment** | 36.69% | **63.31%** | 45.09% | **54.91%** | 42.81% | **57.19%** | **52.94%** | 47.06% |
| **Overall** | 43.62% | **56.38%** | 45.98% | **54.02%** | 45.70% | **54.30%** | **51.86%** | 48.14% |
## Reproduce
## Reproduce
All the code can be found in the `./reproduce` directory.
### Step-0 Extract Unique Contexts
@@ -311,7 +311,7 @@ First, we need to extract unique contexts in the datasets.
<details>
<summary> Code </summary>
```python
def extract_unique_contexts(input_directory, output_directory):
@@ -370,12 +370,12 @@ For the extracted contexts, we insert them into the LightRAG system.
<details>
<summary> Code </summary>
```python
def insert_text(rag, file_path):
with open(file_path, mode='r') as f:
unique_contexts = json.load(f)
retries = 0
max_retries = 3
while retries < max_retries:
@@ -393,11 +393,11 @@ def insert_text(rag, file_path):
### Step-2 Generate Queries
We extract tokens from both the first half and the second half of each context in the dataset, then combine them as the dataset description to generate queries.
We extract tokens from the first and the second half of each context in the dataset, then combine them as dataset descriptions to generate queries.
<details>
<summary> Code </summary>
```python
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
@@ -410,7 +410,7 @@ def get_summary(context, tot_tokens=2000):
summary_tokens = start_tokens + end_tokens
summary = tokenizer.convert_tokens_to_string(summary_tokens)
return summary
```
</details>
@@ -420,12 +420,12 @@ For the queries generated in Step-2, we will extract them and query LightRAG.
<details>
<summary> Code </summary>
```python
def extract_queries(file_path):
with open(file_path, 'r') as f:
data = f.read()
data = data.replace('**', '')
queries = re.findall(r'- Question \d+: (.+)', data)
@@ -479,7 +479,7 @@ def extract_queries(file_path):
```python
@article{guo2024lightrag,
title={LightRAG: Simple and Fast Retrieval-Augmented Generation},
title={LightRAG: Simple and Fast Retrieval-Augmented Generation},
author={Zirui Guo and Lianghao Xia and Yanhua Yu and Tu Ao and Chao Huang},
year={2024},
eprint={2410.05779},