FME-SQL-LLM Report Generator
Type: Template workspace (Form)
Purpose: Turn a plain-language question about a table into a planned set of SQL queries, execute them in DuckDB, and produce a Markdown → HTML report using three lightweight LLM roles (Planner, SQL Generator, Analyst).
What it does
Standardizes input: reads any tabular file (Excel/CSV/Parquet), normalizes headers, and writes a temporary Parquet dataset.
Profiles data: extracts schema, summary stats, and sample rows with DuckDB (compact “metadata” for prompts).
Plans the analysis (Planner LLM): converts the goal + metadata into formal sub-questions and SQL intents.
Drafts SQL (SQL Generator LLM): returns executable SQL per sub-question (minimal JSON: just sql).
Executes safely in DuckDB: runs each query with guardrails (e.g., COPY (SELECT ...), semicolon stripping, optional LIMIT cap).
Analyzes results (Analyst LLM): synthesizes named result sets into a Markdown narrative answering the original question.
Publishes: renders Markdown to HTML with a simple template and writes the final report.
Why it’s useful
Scales expertise: lets non-SQL users ask meaningful questions; the system plans and writes SQL for them.
Keeps LLMs in their lane: models see only metadata + small samples, while DuckDB does the heavy lifting.
Repeatable & portable: drop in any table, update the research question, and re-run—same pipeline, new insight.
Inputs & outputs
Inputs
Any tabular file (e.g., .xlsx, .csv, .parquet).
A natural-language question (e.g., “Which locations have the most cyclists, and when are the peaks?”).
Outputs
report.html — final HTML report (from generated Markdown).
Intermediates (optional to keep):
temp/*.parquet (standardized table)
schema.sql (DDL), summary.json (stats), sample.csv (few rows)
questions.json (planner output), queries.jsonl (SQL list)
results/*.csv (per-query results, capped)
Parameters (exposed)
Research question (string).
Result row cap (int; default 30) for safety/preview.
Keep intermediates (yes/no).
Model choices
Planner model (name / endpoint / temperature)
SQL Generator model (name / endpoint / temperature)
Analyst model (name / endpoint / temperature)
DuckDB work folder (temp path override).
Markdown → HTML template (file/path or inline).
How it works (pipeline)
1) Convert → Parquet
Normalize headers (e.g., replace non-alphanumerics), write a temp Parquet.
2) Profile in DuckDB
Emit schema.sql, summary.json (via SUMMARIZE/aggregates), and sample.csv (tiny SELECT).
3) Plan (Planner LLM)
Input: research question + metadata → Output: sub-questions + SQL intents (strict JSON).
4) Draft SQL (SQL Generator LLM)
For each sub-question → return only sql (strict), flatten to attributes.
5) Execute in DuckDB
- Remove trailing ;
- Wrap non-SELECT so COPY (SELECT ...) TO ... always receives a SELECT
- Apply optional LIMIT {cap} to guard output size
- Save per-query result files
6) Analyze (Analyst LLM)
Provide original task + metadata + named result sets → Markdown narrative with findings/limitations.
7) Publish
Inject Markdown into HTML template and write report.html.
Guardrails & conventions
No raw table → LLM: only schema, summary, sample rows.
Deterministic output shape: use COPY (SELECT ...) for stable CSV schemas.
Safety caps: configurable LIMIT on SELECT queries.
Strict JSON for planner/generator outputs; validate before execution.
Repro hints: log model name, temperature, and prompts with a run ID.
Example prompts (minimal)
Planner (system, abridged):
You are a data analysis planner. Given a goal and compact metadata (schema, summary, sample),
return STRICT JSON with sub-questions and brief SQL intents. No SQL execution.
SQL Generator (system, abridged):
You return ONLY a JSON object: {"sql": "<single SQL statement>"} executable in DuckDB.
No explanation, no comments, no code fences.
Analyst (system, abridged):
Synthesize named result tables into a concise Markdown report answering the original question.
Reference table names; note assumptions and limitations. No code.
Requirements
FME Form 2025.1+
DuckDB (included/embedded or accessible from FME run environment)
LLM endpoints (OpenAI-compatible or local gateway) for the three roles
Extending
Add validation checks (row count, coverage, sum/avg bounds) before the Analyst step.
Swap Markdown → PDF or add small charts from result CSVs.
Cache metadata for repeated runs on the same dataset.
Localize labels / dictionary for non-English reports.
Version & compatibility
v1.0 — initial release (three-role pipeline, HTML reporting).
Tested with FME Form 2025.1 and DuckDB 1.x.
Tags
FME · DuckDB · LLM · OpenAI-compatible · SQL · Analytics · Automation · Parquet · Data Profiling · Markdown · HTML Report
Would you like to know more? Click here to find out more details!
