Use GraphRAG (CLI)
· Install
· Working space
· Indexing
∘ Ingestion
· Configuration
∘ .env
∘ settings.yaml
∘ Chunking file
∘ Example of config
· Run
∘ Global method
∘ Local method
· In Code
Install
pip3 install graphrag
Working space
python3 -m graphrag.index --init --root ./resume-matching
Indexing
Doc: https://microsoft.github.io/graphrag/posts/index/2-cli/
Put data in
resume-matching\input
atm, I try with txt
file.
Ingestion
python3 -m graphrag.index --root ./resume-matching
Configuration
Doc: https://microsoft.github.io/graphrag/posts/config/overview/
In project folder /resume-matching
.env
GRAPHRAG_API_KEY=<API_KEY>
For ollama, replace <API_KEY>
with “ollama”; otherwise, use API key for model provider (e.g., openai, cohere, etc.).
settings.yaml
We focus on two main sections (red lines)
llm
embeddings
Actually, we all know what they are, right? We configure for LLM and Embedding model:
The GRAPHRAG_API_KEY
from the .env
file will be utilized for this purpose (yellow lines). We have the option to assign various model names for either the language model or the embedding model (blue lines).
🚨 For Ollama:
api_base
should be set: https://127.0.0.1:11434/v1
Chunking file
You can change the file type for chunking before performing the indexing.
Example of config
I extend a bit by defining my own variables in the settings.yaml file instead of using the GRAPHRAG_API_KEY.
.env
LLM_API_KEY=ollama
LLM=mistral:latest
#EMBEDDING_API_KEY=sk-l9zasdfasdfasdfasdfasdfasdfasdfUjo29
#EMBEDDING=text-embedding-3-large
EMBEDDING_API_KEY=ollama
EMBEDDING=nomic-embed-text:latest
settings.yaml
encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${LLM_API_KEY}
# type: openai_chat # or azure_openai_chat
model: ${LLM}
model_supports_json: true # recommended if this is available for your model.
# max_tokens: 4000
# request_timeout: 180.0
api_base: https://127.0.0.1:11434/v1
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${EMBEDDING_API_KEY}
# type: openai_embedding # or azure_openai_embedding
model: ${EMBEDDING}
api_base: https://127.0.0.1:11434/v1
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional
chunks:
size: 300
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents
input:
type: file # or blob
file_type: csv #text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.csv$"
cache:
type: file # or blob
base_dir: "cache"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 0
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 0
community_report:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
raw_entities: false
top_level_nodes: false
local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# max_tokens: 12000
global_search:
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32
Run
Doc: https://microsoft.github.io/graphrag/posts/query/3-cli/
Global method
The global search method retrieves answers by searching through all AI-generated community reports in a map-reduce fashion. While resource-intensive, it frequently provides effective responses for queries demanding a holistic dataset comprehension.
python3 -m graphrag.query \
--root ./resume-matching \
--method global \
"show me some matched job positions in the company."
Local method
The local search method combines data from the AI-extracted knowledge-graph with text chunks of raw documents to generate answers. It is effective for questions needing an understanding of specific entities in the documents.
python3 -m graphrag.query \
--root ./resume-matching \
--method local \
"show the jobs related to AI."
In Code
import subprocess
import shlex
.....
query = ....
cmd = [
"python3", "-m", "graphrag.query",
"--root", "./resume-matching",
"--method", "local",
]
cmd.append(shlex.quote(query))
try:
......
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
output = result.stdout
.....
except subprocess.CalledProcessError as e:
error_message = f"An error occurred: {e.stderr}"
....