Skip to content

Commit

Permalink
Merge branch 'microsoft-main' into cutomize
Browse files Browse the repository at this point in the history
  • Loading branch information
KylinMountain committed Aug 9, 2024
2 parents c265587 + 50454ba commit 2f013df
Show file tree
Hide file tree
Showing 39 changed files with 955 additions and 2,892 deletions.
23 changes: 3 additions & 20 deletions .github/workflows/python-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,11 @@ jobs:
GRAPHRAG_API_VERSION: ${{ secrets.GRAPHRAG_API_VERSION }}
GRAPHRAG_LLM_DEPLOYMENT_NAME: ${{ secrets.GRAPHRAG_LLM_DEPLOYMENT_NAME }}
GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME: ${{ secrets.GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME }}
GRAPHRAG_CACHE_TYPE: "blob"
GRAPHRAG_CACHE_CONNECTION_STRING: ${{ secrets.BLOB_STORAGE_CONNECTION_STRING }}
GRAPHRAG_CACHE_CONTAINER_NAME: "cicache"
GRAPHRAG_CACHE_BASE_DIR": "cache"
GRAPHRAG_LLM_MODEL: gpt-3.5-turbo-16k
GRAPHRAG_EMBEDDING_MODEL: text-embedding-ada-002
GRAPHRAG_LLM_MODEL: ${{ secrets.GRAPHRAG_LLM_MODEL }}
GRAPHRAG_EMBEDDING_MODEL: ${{ secrets.GRAPHRAG_EMBEDDING_MODEL }}
GRAPHRAG_ENTITY_EXTRACTION_ENCODING_MODEL: ${{ secrets.GRAPHRAG_ENTITY_EXTRACTION_ENCODING_MODEL }}
# We have Windows + Linux runners in 3.10 and 3.11, so we need to divide the rate limits by 4
GRAPHRAG_LLM_TPM: 45_000 # 180,000 / 4
GRAPHRAG_LLM_RPM: 270 # 1,080 / 4
Expand Down Expand Up @@ -107,19 +106,3 @@ jobs:
- name: Integration Test
run: |
poetry run poe test_integration
# - name: Smoke Test
# if: steps.changes.outputs.python == 'true'
# run: |
# poetry run poe test_smoke

# - uses: actions/upload-artifact@v4
# if: always()
# with:
# name: smoke-test-artifacts-${{ matrix.python-version }}-${{ matrix.poetry-version }}-${{ runner.os }}
# path: tests/fixtures/*/output

# - name: E2E Test
# if: steps.changes.outputs.python == 'true'
# run: |
# ./scripts/e2e-test.sh
108 changes: 108 additions & 0 deletions .github/workflows/python-smoke-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
name: Python Smoke Tests
on:
push:
branches: [main]
pull_request:
branches: [main]

permissions:
contents: read
pull-requests: read

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
# Only run the for the latest commit
cancel-in-progress: true

env:
POETRY_VERSION: 1.8.3

jobs:
python-ci:
strategy:
matrix:
python-version: ["3.10", "3.11"] # add 3.12 once gensim supports it. TODO: watch this issue - https://github.com/piskvorky/gensim/issues/3510
os: [ubuntu-latest, windows-latest]
env:
DEBUG: 1
GRAPHRAG_LLM_TYPE: "azure_openai_chat"
GRAPHRAG_EMBEDDING_TYPE: "azure_openai_embedding"
GRAPHRAG_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GRAPHRAG_API_BASE: ${{ secrets.GRAPHRAG_API_BASE }}
GRAPHRAG_API_VERSION: ${{ secrets.GRAPHRAG_API_VERSION }}
GRAPHRAG_LLM_DEPLOYMENT_NAME: ${{ secrets.GRAPHRAG_LLM_DEPLOYMENT_NAME }}
GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME: ${{ secrets.GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME }}
GRAPHRAG_CACHE_CONTAINER_NAME: "cicache"
GRAPHRAG_CACHE_BASE_DIR": "cache"
GRAPHRAG_LLM_MODEL: ${{ secrets.GRAPHRAG_LLM_MODEL }}
GRAPHRAG_EMBEDDING_MODEL: ${{ secrets.GRAPHRAG_EMBEDDING_MODEL }}
GRAPHRAG_ENTITY_EXTRACTION_ENCODING_MODEL: ${{ secrets.GRAPHRAG_ENTITY_EXTRACTION_ENCODING_MODEL }}
# We have Windows + Linux runners in 3.10 and 3.11, so we need to divide the rate limits by 4
GRAPHRAG_LLM_TPM: 45_000 # 180,000 / 4
GRAPHRAG_LLM_RPM: 270 # 1,080 / 4
GRAPHRAG_EMBEDDING_TPM: 87_500 # 350,000 / 4
GRAPHRAG_EMBEDDING_RPM: 525 # 2,100 / 4
GRAPHRAG_CHUNK_SIZE: 1200
GRAPHRAG_CHUNK_OVERLAP: 0
# Azure AI Search config
AZURE_AI_SEARCH_URL_ENDPOINT: ${{ secrets.AZURE_AI_SEARCH_URL_ENDPOINT }}
AZURE_AI_SEARCH_API_KEY: ${{ secrets.AZURE_AI_SEARCH_API_KEY }}

runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4

- uses: dorny/paths-filter@v3
id: changes
with:
filters: |
python:
- 'graphrag/**/*'
- 'poetry.lock'
- 'pyproject.toml'
- '**/*.py'
- '**/*.toml'
- '**/*.ipynb'
- '.github/workflows/python*.yml'
- 'tests/smoke/*'
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install Poetry
uses: abatilo/[email protected]
with:
poetry-version: $POETRY_VERSION

- name: Install dependencies
shell: bash
run: |
poetry self add setuptools wheel
poetry run python -m pip install gensim
poetry install
- name: Build
run: |
poetry build
- name: Install Azurite
id: azuright
uses: potatoqualitee/[email protected]

- name: Smoke Test
if: steps.changes.outputs.python == 'true'
run: |
poetry run poe test_smoke
- uses: actions/upload-artifact@v4
if: always()
with:
name: smoke-test-artifacts-${{ matrix.python-version }}-${{ matrix.poetry-version }}-${{ runner.os }}
path: tests/fixtures/*/output

- name: E2E Test
if: steps.changes.outputs.python == 'true'
run: |
./scripts/e2e-test.sh
70 changes: 70 additions & 0 deletions .semversioner/0.2.1.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
{
"changes": [
{
"description": "Added default columns for vector store at create_pipeline_config. No change for other cases.",
"type": "patch"
},
{
"description": "Change json parsing error in the map step of global search to warning",
"type": "patch"
},
{
"description": "Fix Local Search breaking when loading Embeddings input. Defaulting overwrite to True as in the rest of the vector store config",
"type": "patch"
},
{
"description": "Fix json parsing when LLM returns faulty responses",
"type": "patch"
},
{
"description": "Fix missing community reports and refactor community context builder",
"type": "patch"
},
{
"description": "Fixed a bug that erased the vector database, added a new parameter to specify the config file path, and updated the documentation accordingly.",
"type": "patch"
},
{
"description": "Try parsing json before even repairing",
"type": "patch"
},
{
"description": "Update Prompt Tuning meta prompts with finer examples",
"type": "patch"
},
{
"description": "Update default entity extraction and gleaning prompts to reduce hallucinations",
"type": "patch"
},
{
"description": "add encoding-model to entity/claim extraction config",
"type": "patch"
},
{
"description": "add encoding-model to text chunking config",
"type": "patch"
},
{
"description": "add user prompt to history-tracking llm",
"type": "patch"
},
{
"description": "update config reader to allow for zero gleans",
"type": "patch"
},
{
"description": "update config-reader to allow for empty chunk-by arrays",
"type": "patch"
},
{
"description": "update history-tracking LLm to use 'assistant' instead of 'system' in output history.",
"type": "patch"
},
{
"description": "use history argument in hash key computation; add history input to cache data",
"type": "patch"
}
],
"created_at": "2024-08-06T00:25:52+00:00",
"version": "0.2.1"
}
22 changes: 22 additions & 0 deletions .semversioner/0.2.2.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"changes": [
{
"description": "Add a check if there is no community record added in local search context",
"type": "patch"
},
{
"description": "Add sepparate workflow for Python Tests",
"type": "patch"
},
{
"description": "Docs updates",
"type": "patch"
},
{
"description": "Run smoke tests on 4o",
"type": "patch"
}
],
"created_at": "2024-08-08T22:40:57+00:00",
"version": "0.2.2"
}
4 changes: 0 additions & 4 deletions .semversioner/next-release/patch-20240726142042913643.json

This file was deleted.

4 changes: 0 additions & 4 deletions .semversioner/next-release/patch-20240726143138162263.json

This file was deleted.

4 changes: 0 additions & 4 deletions .semversioner/next-release/patch-20240726154054702667.json

This file was deleted.

4 changes: 0 additions & 4 deletions .semversioner/next-release/patch-20240726181256417715.json

This file was deleted.

4 changes: 0 additions & 4 deletions .semversioner/next-release/patch-20240726200425411495.json

This file was deleted.

4 changes: 0 additions & 4 deletions .semversioner/next-release/patch-20240726205654788488.json

This file was deleted.

4 changes: 0 additions & 4 deletions .semversioner/next-release/patch-20240729213924743880.json

This file was deleted.

4 changes: 0 additions & 4 deletions .semversioner/next-release/patch-20240730182018844333.json

This file was deleted.

26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,32 @@
# Changelog

Note: version releases in the 0.x.y range may introduce breaking changes.

## 0.2.2

- patch: Add a check if there is no community record added in local search context
- patch: Add sepparate workflow for Python Tests
- patch: Docs updates

## 0.2.1

- patch: Added default columns for vector store at create_pipeline_config. No change for other cases.
- patch: Change json parsing error in the map step of global search to warning
- patch: Fix Local Search breaking when loading Embeddings input. Defaulting overwrite to True as in the rest of the vector store config
- patch: Fix json parsing when LLM returns faulty responses
- patch: Fix missing community reports and refactor community context builder
- patch: Fixed a bug that erased the vector database, added a new parameter to specify the config file path, and updated the documentation accordingly.
- patch: Try parsing json before even repairing
- patch: Update Prompt Tuning meta prompts with finer examples
- patch: Update default entity extraction and gleaning prompts to reduce hallucinations
- patch: add encoding-model to entity/claim extraction config
- patch: add encoding-model to text chunking config
- patch: add user prompt to history-tracking llm
- patch: update config reader to allow for zero gleans
- patch: update config-reader to allow for empty chunk-by arrays
- patch: update history-tracking LLm to use 'assistant' instead of 'system' in output history.
- patch: use history argument in hash key computation; add history input to cache data

## 0.2.0

- minor: Add content-based KNN for selecting prompt tune few shot examples
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ poetry run poe index --init --root .

The GraphRAG project is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using the power of LLMs.

To learn more about GraphRAG and how it can be used to enhance your LLMs ability to reason about your private data, please visit the <a href="https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/" target="_blank">Microsoft Research Blog Post.</a>
To learn more about GraphRAG and how it can be used to enhance your LLM's ability to reason about your private data, please visit the <a href="https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/" target="_blank">Microsoft Research Blog Post.</a>

## Quickstart

Expand Down
1 change: 0 additions & 1 deletion docsite/posts/config/env_vars.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,6 @@ These settings control the text embedding model used by the pipeline. Any settin
| `GRAPHRAG_EMBEDDING_REQUESTS_PER_MINUTE` | | The number of requests per minute to allow for the embedding client. 0 = Bypass | `int` | 0 |
| `GRAPHRAG_EMBEDDING_MAX_RETRIES` | | The maximum number of retries to attempt when a request fails. | `int` | 10 |
| `GRAPHRAG_EMBEDDING_MAX_RETRY_WAIT` | | The maximum number of seconds to wait between retries. | `int` | 10 |
| `GRAPHRAG_EMBEDDING_TARGET` | | The target fields to embed. Either `required` or `all`. | `str` | `required` |
| `GRAPHRAG_EMBEDDING_SLEEP_ON_RATE_LIMIT_RECOMMENDATION` | | Whether to sleep on rate limit recommendation. (Azure Only) | `bool` | `True` |

## Input Settings
Expand Down
1 change: 0 additions & 1 deletion docsite/posts/config/template.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,6 @@ GRAPHRAG_EMBEDDING_API_VERSION="api_version" # For Azure OpenAI Users and if GRA
# GRAPHRAG_ASYNC_MODE=asyncio
# GRAPHRAG_ENCODING_MODEL=cl100k_base
# GRAPHRAG_MAX_CLUSTER_SIZE=10
# GRAPHRAG_ENTITY_RESOLUTION_ENABLED=False
# GRAPHRAG_SKIP_WORKFLOWS=None
# GRAPHRAG_UMAP_ENABLED=False
```
2 changes: 1 addition & 1 deletion docsite/posts/get_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ python -m graphrag.index --root ./ragtest

![pipeline executing from the CLI](/img/pipeline-running.png)

This process will take some time to run. This depends on the size of your input data, what model you're using, and the text chunk size being used (these can be configured in your `.env` file).
This process will take some time to run. This depends on the size of your input data, what model you're using, and the text chunk size being used (these can be configured in your `settings.yml` file).
Once the pipeline is complete, you should see a new folder called `./ragtest/output/<timestamp>/artifacts` with a series of parquet files.

# Using the Query Engine
Expand Down
4 changes: 1 addition & 3 deletions docsite/posts/index/0-architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,8 @@ stateDiagram-v2
Chunk --> ExtractGraph
Chunk --> EmbedDocuments
ExtractGraph --> GenerateReports
ExtractGraph --> EmbedEntities
ExtractGraph --> EmbedGraph
EntityResolution --> EmbedGraph
EntityResolution --> GenerateReports
ExtractGraph --> EntityResolution
```

### Dataframe Message Format
Expand Down
15 changes: 4 additions & 11 deletions docsite/posts/index/1-default_dataflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,7 @@ flowchart TB
subgraph phase2[Phase 2: Graph Extraction]
textUnits --> graph_extract[Entity & Relationship Extraction]
graph_extract --> graph_summarize[Entity & Relationship Summarization]
graph_summarize --> entity_resolve[Entity Resolution]
entity_resolve --> claim_extraction[Claim Extraction]
graph_summarize --> claim_extraction[Claim Extraction]
claim_extraction --> graph_outputs[Graph Tables]
end
subgraph phase3[Phase 3: Graph Augmentation]
Expand Down Expand Up @@ -95,7 +94,7 @@ Entities and Relationships are extracted at once in our _entity_extract_ verb, a
title: Graph Extraction
---
flowchart LR
tu[TextUnit] --> ge[Graph Extraction] --> gs[Graph Summarization] --> er[Entity Resolution]
tu[TextUnit] --> ge[Graph Extraction] --> gs[Graph Summarization]
tu --> ce[Claim Extraction]
```

Expand All @@ -109,18 +108,12 @@ These subgraphs are merged together - any entities with the same _name_ and _typ

Now that we have a graph of entities and relationships, each with a list of descriptions, we can summarize these lists into a single description per entity and relationship. This is done by asking the LLM for a short summary that captures all of the distinct information from each description. This allows all of our entities and relationships to have a single concise description.

### Entity Resolution (Not Enabled by Default)

The final step of graph extraction is to resolve any entities that represent the same real-world entity but but have different names. Since this is done via LLM, and we don't want to lose information, we want to take a conservative, non-destructive approach to this.

Our current implementation of Entity Resolution, however, is destructive. It will provide the LLM with a series of entities and ask it to determine which ones should be merged. Those entities are then merged together into a single entity and their relationships are updated.

We are currently exploring other entity resolution techniques. In the near future, entity resolution will be executed by creating an edge between entity variants indicating that the entities have been resolved by the indexing engine. This will allow for end-users to undo indexing-side resolutions, and add their own non-destructive resolutions using a similar process.

### Claim Extraction & Emission

Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These are emitted as a primary artifact called **Covariates**.

Note: claim extraction is _optional_ and turned off by default. This is because claim extraction generally needs prompt tuning to be useful.

## Phase 3: Graph Augmentation

Now that we have a usable graph of entities and relationships, we want to understand their community structure and augment the graph with additional information. This is done in two steps: _Community Detection_ and _Graph Embedding_. These give us explicit (communities) and implicit (embeddings) ways of understanding the topological structure of our graph.
Expand Down
Loading

0 comments on commit 2f013df

Please sign in to comment.