Methodology & Data Sources
How we collect, clean, secure, and query Dubai real-estate data for the IRETP demo.
Data Sources
Official Dubai Land Department transaction feed (sales, mortgages, grants, lease-to-own). We loaded a 100,000-row stratified sample for the demo - enough to cover every transaction type, area, and year while keeping query latency under ~1s on the demo Postgres tier and AI Gateway token costs predictable. The same pipeline scales to the full 1.5M+ rows; it's a config flag, not a code change.
View sourceHistorical DLD dataset spanning 1969–2023. Loaded into kaggle_dld_transactions_historic (250K-row sample), kaggle_dld_valuations (73K rows, full), and kaggle_dld_rents (100K-row Ejari sample).
View sourceMulti-year bilingual (Arabic + English) DLD transactions dump. Loaded in full into kaggle_dre_transactions via Kaggle API + psql COPY.
View sourceUAE rental listings scrape (Bayut/Dubizzle-style). Loaded in full into kaggle_uae_rental_listings - powers rental-yield and price-vs-rent queries alongside the transaction data.
View sourceIngestion Pipeline
Download
Dubai Pulse Parquet is fetched directly from Hugging Face. Kaggle sources are listed and await credential-based download.
Sample & Filter
DuckDB is used to read Parquet/CSV, apply stratified sampling (100k rows), and filter out malformed or incomplete records.
Normalize
Columns are renamed to a consistent schema: price fields in AED, areas in sqft, dates as ISO 8601, and categorical fields lower-cased and trimmed.
Load to Postgres
Cleaned data is written to Postgres via bulk COPY. Indexes are built on date, area, property type, and procedure for fast query performance.
Transaction Schema
| Column | Type | Description |
|---|---|---|
| id | BIGSERIAL | Unique row identifier |
| procedure_name | TEXT | Transaction type (Sale, Mortgage, Grant, etc.) |
| transaction_date | DATE | Date the transaction was recorded |
| property_type | TEXT | High-level category (Villa, Unit, Building, Land, etc.) |
| property_sub_type | TEXT | Sub-category (Flat, Townhouse, etc.) |
| property_usage | TEXT | Residential, commercial, industrial, etc. |
| area_name | TEXT | Dubai community or district |
| building_name | TEXT | Named building or tower |
| master_project | TEXT | Larger development project name |
| nearest_landmark | TEXT | Nearby landmark (Burj Khalifa, etc.) |
| nearest_metro | TEXT | Closest Dubai Metro station |
| nearest_mall | TEXT | Closest major shopping mall |
| bedrooms | TEXT | Bedroom count (Studio, 1, 2, 3, etc.) |
| parkings | INTEGER | Number of parking spaces |
| built_up_area_sqft | NUMERIC | Property size in square feet |
| selling_price_aed | NUMERIC | Total transaction value in AED |
| sale_price_per_sqft | NUMERIC | Derived AED per sqft for sales |
| rent_value_aed | NUMERIC | Rental value where applicable |
| rent_price_per_sqft | NUMERIC | Derived AED per sqft for rentals |
Security & AI Query Model
Read-Only SQL Guard
The AI does not write to the database. All queries pass through a run_sql PostgreSQL function with three layers of protection:
- Regex gate - queries must start with
SELECTorWITH. - Keyword blacklist -
INSERT,UPDATE,DELETE,DROP,ALTER, and similar are blocked. - Row cap - results are aggregated through
jsonb_aggwhich caps output at 500 rows per query.
AI-to-SQL Flow
When you ask a question in natural language, the following happens:
- The AI receives the database schema and your question.
- It drafts a PostgreSQL SELECT query tailored to the schema.
- The query is executed via the read-only
run_sqlfunction. - If the result is numeric or time-series, the AI emits a
chartJSON block and the UI renders a Recharts visualization inline. - Otherwise, the raw rows are shown in a sortable data table.
Use Cases Covered
- Market Transparency Q&A - average price per sqft by community, top 10 areas by volume, year-over-year trends.
- Investor Insights - median sale prices, rental yield proxies, hotspot detection by transaction volume.
- Regulator Signals - outlier price spikes, unusual transfer patterns, area-level activity dashboards.