Use GraphRAG (CLI)

TeeTracker
4 min readJul 6, 2024

--

· Install
· Working space
· Indexing
Ingestion
· Configuration
.env
settings.yaml
Chunking file
Example of config
· Run
Global method
Local method
· In Code

Install

pip3 install graphrag

Working space

python3 -m graphrag.index --init --root ./resume-matching

Indexing

Doc: https://microsoft.github.io/graphrag/posts/index/2-cli/

Put data in

resume-matching\input

atm, I try with txt file.

Ingestion

python3 -m graphrag.index --root ./resume-matching

Configuration

Doc: https://microsoft.github.io/graphrag/posts/config/overview/

In project folder /resume-matching

.env

GRAPHRAG_API_KEY=<API_KEY>

For ollama, replace <API_KEY>with “ollama”; otherwise, use API key for model provider (e.g., openai, cohere, etc.).

settings.yaml

We focus on two main sections (red lines)

  • llm
  • embeddings

Actually, we all know what they are, right? We configure for LLM and Embedding model:

The GRAPHRAG_API_KEY from the .env file will be utilized for this purpose (yellow lines). We have the option to assign various model names for either the language model or the embedding model (blue lines).

🚨 For Ollama:

api_baseshould be set: https://127.0.0.1:11434/v1

Chunking file

You can change the file type for chunking before performing the indexing.

Example of config

I extend a bit by defining my own variables in the settings.yaml file instead of using the GRAPHRAG_API_KEY.

.env

LLM_API_KEY=ollama
LLM=mistral:latest
#EMBEDDING_API_KEY=sk-l9zasdfasdfasdfasdfasdfasdfasdfUjo29
#EMBEDDING=text-embedding-3-large
EMBEDDING_API_KEY=ollama
EMBEDDING=nomic-embed-text:latest

settings.yaml


encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${LLM_API_KEY}
# type: openai_chat # or azure_openai_chat
model: ${LLM}
model_supports_json: true # recommended if this is available for your model.
# max_tokens: 4000
# request_timeout: 180.0
api_base: https://127.0.0.1:11434/v1
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${EMBEDDING_API_KEY}
# type: openai_embedding # or azure_openai_embedding
model: ${EMBEDDING}
api_base: https://127.0.0.1:11434/v1
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional


chunks:
size: 300
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
type: file # or blob
file_type: csv #text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.csv$"

cache:
type: file # or blob
base_dir: "cache"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>

storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>

reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>

entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 0

summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500

claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 0

community_report:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000

cluster_graph:
max_cluster_size: 10

embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832

umap:
enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
graphml: false
raw_entities: false
top_level_nodes: false

local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# max_tokens: 12000

global_search:
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32

Run

Doc: https://microsoft.github.io/graphrag/posts/query/3-cli/

Global method

The global search method retrieves answers by searching through all AI-generated community reports in a map-reduce fashion. While resource-intensive, it frequently provides effective responses for queries demanding a holistic dataset comprehension.

python3 -m graphrag.query \
--root ./resume-matching \
--method global \
"show me some matched job positions in the company."

Local method

The local search method combines data from the AI-extracted knowledge-graph with text chunks of raw documents to generate answers. It is effective for questions needing an understanding of specific entities in the documents.

python3 -m graphrag.query \
--root ./resume-matching \
--method local \
"show the jobs related to AI."

In Code

import subprocess
import shlex

.....
query = ....
cmd = [
"python3", "-m", "graphrag.query",
"--root", "./resume-matching",
"--method", "local",
]
cmd.append(shlex.quote(query))

try:
......
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
output = result.stdout
.....
except subprocess.CalledProcessError as e:
error_message = f"An error occurred: {e.stderr}"
....

--

--