Material Specs KB — Python RAG Pipeline

When your team needs to own the code

Some clients want a knowledge assistant they can log into and maintain through a UI. Others want full control — the source code, the infrastructure, the ability to extend it themselves. This project was built for the second type of client.

Material Specs KB is a RAG (Retrieval-Augmented Generation) pipeline built entirely in Python. It ingests 33 manufacturer PDFs from Google Drive, indexes them in a Pinecone vector database, and serves precise technical answers through a Flask web interface. Engineers and procurement teams can query adhesive specifications, application guidelines, and performance data in plain language — and get answers with exact values and source citations.

A companion no-code version of this same system was also built using Flowise. The Python version exists to demonstrate a different delivery track — one where the client owns every line of code.

Material Specs KB Python RAG pipeline returning adhesive specifications from 3M manufacturer PDF

The problem

Industrial material specifications live in PDF datasheets — dozens of them, spread across multiple manufacturers, each with different formatting and terminology. Finding a specific value (cure time, tensile strength, temperature resistance) means knowing which product to look in, finding the right PDF, and scanning the right table.

For teams sourcing adhesives, sealants, or specialty coatings across multiple vendors, this is a recurring time cost. A knowledge assistant that can answer “What is the overlap shear strength of 3M 468MP?” in seconds — with a source citation — is a practical productivity tool.

System architecture

Google Drive (33 PDFs)
        |
        v
  fetch.py — Google Drive API + PyMuPDF
        |
        v
  fetched_docs.json (raw extracted text)
        |
        v
  index_all.py — OpenAI text-embedding-3-small
        |
        v
  Pinecone (material-specs-python index, 83 vectors)
        |
        v
  app.py — Flask + GPT-4o-mini
        |
        v
  Chat UI (render.com)

What was built

fetch.py connects to Google Drive via a service account, recursively scans a shared folder, downloads each PDF, and extracts plain text using PyMuPDF. 33 PDFs processed, zero skipped.

index_all.py chunks the extracted text (1,000 words, 200-word overlap), embeds each chunk using OpenAI’s text-embedding-3-small model, and upserts the vectors to a Pinecone serverless index. 83 vectors total.

query.py provides a CLI test interface for validating retrieval before building the web layer.

app.py is a Flask application that embeds incoming questions, retrieves the top 5 matching chunks from Pinecone, passes them as context to GPT-4o-mini, and returns the answer with source document names.

Results

The system retrieves precise technical values from the correct source documents. A query for 3M Adhesive Transfer Tape 468MP returns peel adhesion values, shear strength, temperature resistance, and dielectric properties — with sources cited. A query for pricing returns a clean refusal: “I don’t have that information in the loaded specifications.”

Retrieval is fast. The chat UI is clean and usable by non-technical staff. The full codebase is on GitHub and deployable to any Python-compatible host.

Why Python over no-code

The Flowise version of this project does the same job with less setup — no coding required, Google Drive OAuth built in, visual canvas for the pipeline. For an SME client who wants a working system fast and doesn’t have developers on staff, that’s the right choice.

The Python version is the right choice when:

The client wants to own and extend the codebase
The system needs to integrate with internal APIs or databases
The team has developers who will maintain it
Monthly SaaS tool costs need to be eliminated long-term
The project is a foundation for a larger AI system

Both tracks deliver the same outcome. The decision depends on the client’s team, not the complexity of the problem.

Stack

Ingestion: Python, Google Drive API, PyMuPDF
Embeddings: OpenAI text-embedding-3-small
Vector database: Pinecone (serverless, AWS us-east-1)
LLM: GPT-4o-mini
Web framework: Flask
Deployment: Render
Version control: GitHub

Live demo

material-specs-python.onrender.com

GitHub: github.com/writingteacher/material-specs-python