Merge pull request #56 from sank8-2/dev

chore: added pre-commit-hooks and ruff formatting for commit-hooks
2024-10-19 20:44:11 +08:00
parent 515b890aab 32464fab4e
commit f9acf37398
26 changed files with 635 additions and 393 deletions
--- a/README.md
+++ b/README.md
@@ -16,16 +16,16 @@
        <a href="https://pypi.org/project/lightrag-hku/"><img src="https://img.shields.io/pypi/v/lightrag-hku.svg"></a>
        <a href="https://pepy.tech/project/lightrag-hku"><img src="https://static.pepy.tech/badge/lightrag-hku/month"></a>
    </p>
-    
+
 This repository hosts the code of LightRAG. The structure of this code is based on [nano-graphrag](https://github.com/gusye1234/nano-graphrag).
 ![请添加图片描述](https://i-blog.csdnimg.cn/direct/b2aaf634151b4706892693ffb43d9093.png)
 </div>

-## 🎉 News 
+## 🎉 News
 - [x] [2024.10.18]🎯🎯📢📢We’ve added a link to a [LightRAG Introduction Video](https://youtu.be/oageL-1I0GE). Thanks to the author!
 - [x] [2024.10.17]🎯🎯📢📢We have created a [Discord channel](https://discord.gg/mvsfu2Tg)! Welcome to join for sharing and discussions! 🎉🎉
- [x] [2024.10.16]🎯🎯📢📢LightRAG now supports [Ollama models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)! 
- [x] [2024.10.15]🎯🎯📢📢LightRAG now supports [Hugging Face models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)! 
+- [x] [2024.10.16]🎯🎯📢📢LightRAG now supports [Ollama models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
+- [x] [2024.10.15]🎯🎯📢📢LightRAG now supports [Hugging Face models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!

 ## Install

@@ -92,7 +92,7 @@ print(rag.query("What are the top themes in this story?", param=QueryParam(mode=
 <details>
 <summary> Using Open AI-like APIs </summary>

-LightRAG also support Open AI-like chat/embeddings APIs:
+LightRAG also supports Open AI-like chat/embeddings APIs:
 ```python
 async def llm_model_func(
    prompt, system_prompt=None, history_messages=[], **kwargs
@@ -129,7 +129,7 @@ rag = LightRAG(

 <details>
 <summary> Using Hugging Face Models </summary>
-     
+
 If you want to use Hugging Face models, you only need to set LightRAG as follows:
 ```python
 from lightrag.llm import hf_model_complete, hf_embedding
@@ -145,7 +145,7 @@ rag = LightRAG(
        embedding_dim=384,
        max_token_size=5000,
        func=lambda texts: hf_embedding(
-            texts, 
+            texts,
            tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
            embed_model=AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        )
@@ -157,7 +157,7 @@ rag = LightRAG(
 <details>
 <summary> Using Ollama Models </summary>
 If you want to use Ollama models, you only need to set LightRAG as follows:
-     
+
 ```python
 from lightrag.llm import ollama_model_complete, ollama_embedding

@@ -171,7 +171,7 @@ rag = LightRAG(
        embedding_dim=768,
        max_token_size=8192,
        func=lambda texts: ollama_embedding(
-            texts, 
+            texts,
            embed_model="nomic-embed-text"
        )
    ),
@@ -196,14 +196,14 @@ with open("./newText.txt") as f:
 ```
 ## Evaluation
 ### Dataset
-The dataset used in LightRAG can be download from [TommyChien/UltraDomain](https://huggingface.co/datasets/TommyChien/UltraDomain).
+The dataset used in LightRAG can be downloaded from [TommyChien/UltraDomain](https://huggingface.co/datasets/TommyChien/UltraDomain).

 ### Generate Query
-LightRAG uses the following prompt to generate high-level queries, with the corresponding code located in `example/generate_query.py`.
+LightRAG uses the following prompt to generate high-level queries, with the corresponding code in `example/generate_query.py`.

 <details>
 <summary> Prompt </summary>
-     
+
 ```python
 Given the following description of a dataset:

@@ -228,18 +228,18 @@ Output the results in the following structure:
    ...
 ```
 </details>
- 
+
 ### Batch Eval
 To evaluate the performance of two RAG systems on high-level queries, LightRAG uses the following prompt, with the specific code available in `example/batch_eval.py`.

 <details>
 <summary> Prompt </summary>
-     
+
 ```python
 ---Role---
 You are an expert tasked with evaluating two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
 ---Goal---
-You will evaluate two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**. 
+You will evaluate two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.

 - **Comprehensiveness**: How much detail does the answer provide to cover all aspects and details of the question?
 - **Diversity**: How varied and rich is the answer in providing different perspectives and insights on the question?
@@ -303,7 +303,7 @@ Output your evaluation in the following JSON format:
 | **Empowerment**       | 36.69%                  | **63.31%**             | 45.09%                | **54.91%**             | 42.81%                | **57.19%**             | **52.94%**            | 47.06%                |
 | **Overall**           | 43.62%                  | **56.38%**             | 45.98%                | **54.02%**             | 45.70%                | **54.30%**             | **51.86%**            | 48.14%                |

-## Reproduce 
+## Reproduce
 All the code can be found in the `./reproduce` directory.

 ### Step-0 Extract Unique Contexts
@@ -311,7 +311,7 @@ First, we need to extract unique contexts in the datasets.

 <details>
 <summary> Code </summary>
-     
+
 ```python
 def extract_unique_contexts(input_directory, output_directory):

@@ -370,12 +370,12 @@ For the extracted contexts, we insert them into the LightRAG system.

 <details>
 <summary> Code </summary>
-     
+
 ```python
 def insert_text(rag, file_path):
    with open(file_path, mode='r') as f:
        unique_contexts = json.load(f)
-    
+
    retries = 0
    max_retries = 3
    while retries < max_retries:
@@ -393,11 +393,11 @@ def insert_text(rag, file_path):

 ### Step-2 Generate Queries

-We extract tokens from both the first half and the second half of each context in the dataset, then combine them as the dataset description to generate queries.
+We extract tokens from the first and the second half of each context in the dataset, then combine them as dataset descriptions to generate queries.

 <details>
 <summary> Code </summary>
-     
+
 ```python
 tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

@@ -410,7 +410,7 @@ def get_summary(context, tot_tokens=2000):

    summary_tokens = start_tokens + end_tokens
    summary = tokenizer.convert_tokens_to_string(summary_tokens)
-    
+
    return summary
 ```
 </details>
@@ -420,12 +420,12 @@ For the queries generated in Step-2, we will extract them and query LightRAG.

 <details>
 <summary> Code </summary>
-     
+
 ```python
 def extract_queries(file_path):
    with open(file_path, 'r') as f:
        data = f.read()
-    
+
    data = data.replace('**', '')

    queries = re.findall(r'- Question \d+: (.+)', data)
@@ -479,7 +479,7 @@ def extract_queries(file_path):

 ```python
@article{guo2024lightrag,
-title={LightRAG: Simple and Fast Retrieval-Augmented Generation}, 
+title={LightRAG: Simple and Fast Retrieval-Augmented Generation},
 author={Zirui Guo and Lianghao Xia and Yanhua Yu and Tu Ao and Chao Huang},
 year={2024},
 eprint={2410.05779},