algae graphrag thesis

Hybrid GraphRAG system combining Neo4j knowledge graphs with vector search for domain-specific Q&A on algae applications

🔗View on GitHub

🚀 Launch Notebook

📄 README

Domain-Specific Chatbot for Algae

Master's thesis project - MSc Data Science, University of Southern Denmark (Kolding)

Overview

Algae represent a sustainable and versatile resource with growing applications across multiple industries, including biofuel production, pharmaceuticals, nutraceuticals, fertilizers, and environmental management. This project develops a domain-specific chatbot that leverages a hybrid GraphRAG approach, combining knowledge graph-based retrieval with vector similarity search to provide accurate, real-time responses to user queries about algae-related topics.

The system integrates Neo4j for structured knowledge representation with vector embeddings for semantic search, using LangChain for orchestration. This approach leverages both symbolic reasoning and semantic understanding to serve as a valuable tool for the algae industry and research community.

Research Questions

RQ1: How can a hybrid GraphRAG architecture be effectively designed and implemented to answer domain-specific questions about algae with high accuracy and relevance?
RQ2: Which retrieval strategies and embedding models yield the best performance for retrieving relevant information from a heterogeneous corpus of algae-related documents?
RQ3: How does the integration of knowledge graphs with vector-based retrieval impact the quality and factual grounding of generated responses compared to vector-only approaches?

Expected Deliverables

Data pipeline: Automated ingestion, processing, and indexing of multi-gigabyte corpus from diverse sources (research papers, Wikipedia, blogs, catalogs)
Vector database: Searchable vector store with document embeddings and metadata
Knowledge graph: Neo4j-based representation capturing entities (species, cultivation methods, applications) and their relationships
GraphRAG chatbot: Functional chatbot implementing hybrid retrieval combining knowledge graph traversal with vector similarity search
Evaluation report: Comparative analysis of GraphRAG vs vector-only RAG with quantitative metrics

System Architecture

The hybrid GraphRAG architecture consists of:

Indexing component: Converts documents into vector embeddings using OpenAI embeddings, Sentence-BERT, or domain-adapted alternatives
Knowledge graph construction: Extracts entities and relationships from corpus and stores them in Neo4j
Hybrid retrieval: Combines vector similarity search with graph traversal using Cypher queries
Generation component: LLM synthesizes combined context into coherent answers

Project Structure

project/ ├── data/ │ ├── raw/ # Original PDFs and downloads (unmodified) │ └── processed/ # Chunked texts and cleaned data ├── src/ │ ├── ingestion/ # PDF loading, chunking │ ├── retrieval/ # Vector search, graph queries │ └── generation/ # LLM calls, prompts ├── notebooks/ # Experiments and exploration ├── tests/ # Unit tests ├── outputs/ # Generated results, evaluation reports ├── .env # API keys (not tracked) ├── .gitignore ├── requirements.txt └── README.md

Tech Stack

Language: Python
RAG orchestration: LangChain / LlamaIndex
Vector storage: ChromaDB / Pinecone / FAISS
Knowledge graph: Neo4j (Cypher queries, GraphCypherQAChain)
LLMs: OpenAI API and/or open-source alternatives
Entity extraction: LLM-based pipelines
User interface: Streamlit / Flask

Evaluation Metrics

HOPE
Retrieval quality: Precision@K, Recall@K, MRR
Answer quality: Faithfulness, relevance, groundedness
Response latency
Baseline comparison: GraphRAG vs vector-only RAG

Timeline

| Phase | Period | Focus | |-------|--------|-------| | 1 | Feb - Mar | Preprocessing pipeline, knowledge graph schema design, initial corpus indexing | | 2 | Mar - Apr | Knowledge graph construction in Neo4j, RAG implementation, embedding model comparison | | 3 | Apr - May | GraphRAG integration, chatbot interface, evaluation, thesis writing | | 4 | May - Jun | Final testing, documentation, thesis writing, defense preparation |

Status

Currently evaluating the best chunking strategy and gauging pdf summary and information retrieval quality.

Author

Filip Nový - finov24@student.sdu.dk

Supervisor: Tariq Youssef
Department: Mathematics and Computer Science, University of Southern Denmark

References

Lewis et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks
Gao et al. (2023). Retrieval-augmented generation for large language models: A survey
Karpukhin et al. (2020). Dense passage retrieval for open-domain question answering
Hogan et al. (2021). Knowledge graphs