Building an AI SQL Query Generator for Your CRE Data Lake

Why commercial real estate leaders are moving from dashboard to self-service intelligence - and how to do it without creating new risks.
Your company's most valuable intelligence sits in the same place it has for decades: a SQL database storing lease rolls, occupancy trends, tenant financials, and operating fundamentals. The data is clean, curated, and comprehensive; but the problem is that most of the people who need it cannot query a SQL database.
This is not a data problem; it is an access-layer problem. Your leases, your rent rolls, your environmental compliance records, and your investment performance metrics are all inaccessible to the average employee, requiring specialized assistance from someone who knows how to write SQL queries.
The actual SQL query - the technical instruction that tells the database what to retrieve - might take fifteen minutes to write if you speak the “language”. But the business user does not, so they wait - days, sometimes weeks - to get assistance from a data scientist.
Here, the cost is not just time; it is the gradual erosion of data-driven decision-making. When asking a question becomes expensive, people stop asking.
1. Recent Tech Innovations Have Enabled Accurate Text-to-SQL
In the last eighteen months, AI pipelines for text-to-SQL applications have undergone a significant transformation. The capability to convert natural language into executable database queries has transitioned from an experimental concept to a production-ready feature. This increased performance is coupled with a drastic reduction in cost: building a text-to-SQL pipeline today costs only one-tenth of what it did just three years ago.
The evolution happened in two distinct phases, each occurring since the launch of ChatGPT in 2022. Around the time ChatGPT came out, the original text-to-SQL pipelines required fine tuning large language models on thousands of proprietary SQL examples. A great example of this was when JLL released their own proprietary model, JLL GPT, which was a fine tuned LLM for real estate use cases.
However, fine tuning is an enormous task - these projects were expensive, rigid, and brittle. There are unforeseen pitfalls to fine tuning: for example, when the database schema shifted (a new table added, a column renamed) fine tuned models are known to fail and require expensive retraining.
Today, the leading methodologies require no fine tuning at all. We can use the most common and popular AI models, such as OpenAI’s GPT or Anthropic’s Sonnet. Instead of fine tuning these models, we enrich these “base” models with detailed database metadata: descriptions of tables, columns, value distributions, and historical query patterns. These modern systems achieve state-of-the-art accuracy on complex database schemas (Automatic Metadata Extraction for Text-to-SQL).
Industry benchmarks show these approaches handling multi-table joins and aggregate calculations with over 80 percent reliability. The implication is profound: you no longer need a machine learning team to deploy natural language access to your data warehouse. You need a software engineering approach that treats the model as a capable, but context-hungry, reasoning engine.
2. What a Modern Text to SQL Pipeline Actually Does
Sections 2 and 3 contain technical information about the implementation of a text-to-SQL pipeline. If you’d prefer to not learn about the technical implementation of a text-to-SQL pipeline, please skip to section 4.
The core insight of current-generation text-to-SQL systems is: publicly-available LLMs are already good at writing SQL. What that LLM lacks is context about your specific business. A modern pipeline therefore focuses on briefing the model exhaustively rather than retraining it.
The process begins with few-shot learning. We work with subject matter experts in your company to create training examples for your AI pipeline, leveraging the work of your pre-existing analytics system. Ultimately, we leverage this data to teach the model how your organization actually talks about its data.
Next, we need to understand the metadata of your database. We conduct an exploratory analysis of your data lake, and generate rich metadata about every table and column - not just names, but value distributions, foreign key relationships, and business descriptions mined from existing documentation. We analyze historical query logs to identify common join paths and implicit business logic that veteran analysts know but which never appears in schema definitions. All of this data is stored inside of a vector database, which allows us to create a search engine for our generated metadata and historical queries.
At this stage, we possess sufficient metadata to successfully query your database using an LLM. Let’s put our process engineering hats on - using the data we generated in the previous steps, our AI-powered text-to-SQL pipeline will follow these steps.
- The AI pipeline clarifies your question and asks any follow-up questions for ambiguous user queries
- Based on the user's query, the AI pipeline must search for all of the relevant context. When a user submits a question, the AI pipeline first retrieves the relevant few-shot examples and metadata from the vector database created in the previous steps.
- The context fetched in step 2 is injected into a prompt alongside the user query. We send this to an LLM and ask it to generate SQL.
- We query the data warehouse using the LLM’s generated SQL query and return the data to the user.
If you’re struggling to understand this pipeline, please see our interactive artifact to walk through a real-world example.
This "write first, validate later" methodology is more dependable than classification because it capitalizes on the model's primary strength (reasoning about data relationships) rather than its weaker ability (judging relevance). The result is a system where metadata, carefully injected into the model's context window, consistently outperforms the querying accuracy of fine tuned models.
For a commercial real estate firm, this means the system learns the nuances of your data schema and business rules - all without custom model training. These two steps cover the core LLM-centric part of the text-to-SQL process (assembling the prompt and generating the SQL).
For text-to-SQL benchmarking we use the BIRD Benchmark, an industry standard for text-to-SQL generation. Our methodology is based off #1 ranked method, AskData + GPT-4o by AT&T CDO - DSAIR, which is the state-of-the-art method for text-to-SQL.
3. Data Sovereignty Without Sacrifice: Modern Deployment Options
A significant breakthrough in Text-to-SQL is the newfound ability to achieve high accuracy without sacrificing control over your data. Organizations are no longer forced to send sensitive information to public APIs for optimal performance.
Four distinct deployment models are now available, offering leadership teams the flexibility to balance speed, cost, and security based on their specific risk tolerance and budget. The following deployment models are listed from easiest to hardest for a software engineering team:
- API-Based Access (easiest): This model offers the fastest path to proof-of-concept and the lowest infrastructure cost. While your data leaves your environment, providers are contractually prohibited from using it for training. This option requires the highest trust in the provider's security and contractual commitments. (OpenAI API, Anthropic API)
- Managed Service with VPC: The most common choice for enterprise deployments, this infrastructure uses managed services to run LLM inference like AWS Bedrock and Azure OpenAI within your own Virtual Private Cloud. Your data stays within your network perimeter, while the infrastructure provider manages patching, scaling, and availability. It provides a security posture comparable to established SaaS financial systems without demanding new operational expertise.
- Bare Metal Server in a VPC: Designed for organizations with stringent compliance needs or those managing highly sensitive financial data, this option eliminates shared hypervisors and multi-tenant hardware. You rent a bare metal GPU server (like an NVIDIA H100) from your VPC provider. Here, you have guaranteed hardware-level isolation, but you must manage the inference service on your own (typically with technologies like Ollama or VLLM). You gain maximum control but the responsibility for infrastructure operations falls to your team.
- Bare Metal Server On-Premises (hardest): This represents the highest degree of data sovereignty, completely eliminating external network calls. It typically relies on open-source models, which may involve a modest trade-off in accuracy compared to leading commercial systems. Here, your team would manage both the inference service (Ollama or VLLM) along with the procurement, provisioning, and hosting of the server. While the investment is substantial, it is the necessary choice for heavily regulated environments or those with classified holdings. To run the state-of-the-art AI models, you’ll likely need 8 NVIDIA H100s or more.
The critical takeaway is that organizations can now match their deployment model to their specific compliance needs and risk profile without compromising the user experience or the AI's capability. When we deploy a text-to-SQL system for a client, it typically mixes two or more of these approaches.
4. Evaluations: How We Create Auditable AI
Deploying AI in production introduces risks that traditional software does not. Language models are non-deterministic: the same question asked twice might yield two different answers.
The solution is not to avoid AI but to engineer around its variability through rigorous evaluation frameworks. In production environments, every change - whether to prompts, model versions, or database metadata - must pass through gated validation before release.
For text-to-SQL specifically, verification must go beyond simple string matching. A robust system checks that the generated query parses and executes, that it includes mandatory filters (such as date ranges or portfolio exclusions), that it follows approved join paths, and that aggregate groupings align with the business intent. It validates results against seeded test data for critical questions.
Leading implementations avoid single "accuracy percentage" metrics. Instead, they report per-category pass rates with confidence intervals, tracking performance across query types (single-table lookups versus multi-step aggregations) and business domains (property management versus investment analysis). You can think of this as a new type of “production monitoring”, where we are tracking our AI pipeline’s performance against a known set of test cases.
We focus on the failure modes that erode trust: execution timeouts, unusual table access patterns, missing filters when the question clearly implies constraints, and questions the system should decline to answer. Each failure feeds back into a growing regression suite, ensuring the system strengthens with use rather than degrading.
Even with these advancements, human judgment remains essential for confidence in AI systems. The reality is that even the best-performing text-to-SQL systems achieve a maximum accuracy of 70% to 80%. Therefore, any queries used for business-critical decisions must still be verified by a subject matter expert within your organization. Nonetheless, having the capability of an introductory-level data scientist via a chat interface presents a significant improvement over current methods of data analysis inside of CRE firms.
5. Conclusion
The firms gaining separation in CRE are not distinguished by the volume of their data, but by the velocity of the insights you can gain from them. When a leasing director can test a market hunch in seconds rather than weeks, the compound effect across a portfolio is transformative.
Jacob Coccari