Back to demo

Methodology & Data Sources

How we collect, clean, secure, and query Dubai real-estate data for the IRETP demo.

Data Sources

Dubai Pulse Transactions
Hugging Face - viewit-ai/full-dubai-pulse
100,000 rows

Official Dubai Land Department transaction feed (sales, mortgages, grants, lease-to-own). We loaded a 100,000-row stratified sample for the demo - enough to cover every transaction type, area, and year while keeping query latency under ~1s on the demo Postgres tier and AI Gateway token costs predictable. The same pipeline scales to the full 1.5M+ rows; it's a config flag, not a code change.

View source
Dubai Real Estates 1969–2023
Kaggle - zainabhejairi
423,677 rows

Historical DLD dataset spanning 1969–2023. Loaded into kaggle_dld_transactions_historic (250K-row sample), kaggle_dld_valuations (73K rows, full), and kaggle_dld_rents (100K-row Ejari sample).

View source
Dubai RE Transactions Dataset
Kaggle - alexefimik
1,047,965 rows

Multi-year bilingual (Arabic + English) DLD transactions dump. Loaded in full into kaggle_dre_transactions via Kaggle API + psql COPY.

View source
UAE Rental Market Listings
Kaggle - azharsaleem
73,742 rows

UAE rental listings scrape (Bayut/Dubizzle-style). Loaded in full into kaggle_uae_rental_listings - powers rental-yield and price-vs-rent queries alongside the transaction data.

View source

Ingestion Pipeline

1

Download

Dubai Pulse Parquet is fetched directly from Hugging Face. Kaggle sources are listed and await credential-based download.

2

Sample & Filter

DuckDB is used to read Parquet/CSV, apply stratified sampling (100k rows), and filter out malformed or incomplete records.

3

Normalize

Columns are renamed to a consistent schema: price fields in AED, areas in sqft, dates as ISO 8601, and categorical fields lower-cased and trimmed.

4

Load to Postgres

Cleaned data is written to Postgres via bulk COPY. Indexes are built on date, area, property type, and procedure for fast query performance.

Transaction Schema

ColumnTypeDescription
idBIGSERIALUnique row identifier
procedure_nameTEXTTransaction type (Sale, Mortgage, Grant, etc.)
transaction_dateDATEDate the transaction was recorded
property_typeTEXTHigh-level category (Villa, Unit, Building, Land, etc.)
property_sub_typeTEXTSub-category (Flat, Townhouse, etc.)
property_usageTEXTResidential, commercial, industrial, etc.
area_nameTEXTDubai community or district
building_nameTEXTNamed building or tower
master_projectTEXTLarger development project name
nearest_landmarkTEXTNearby landmark (Burj Khalifa, etc.)
nearest_metroTEXTClosest Dubai Metro station
nearest_mallTEXTClosest major shopping mall
bedroomsTEXTBedroom count (Studio, 1, 2, 3, etc.)
parkingsINTEGERNumber of parking spaces
built_up_area_sqftNUMERICProperty size in square feet
selling_price_aedNUMERICTotal transaction value in AED
sale_price_per_sqftNUMERICDerived AED per sqft for sales
rent_value_aedNUMERICRental value where applicable
rent_price_per_sqftNUMERICDerived AED per sqft for rentals

Security & AI Query Model

Read-Only SQL Guard

The AI does not write to the database. All queries pass through a run_sql PostgreSQL function with three layers of protection:

  1. Regex gate - queries must start with SELECT or WITH.
  2. Keyword blacklist - INSERT, UPDATE, DELETE, DROP, ALTER, and similar are blocked.
  3. Row cap - results are aggregated through jsonb_agg which caps output at 500 rows per query.

AI-to-SQL Flow

When you ask a question in natural language, the following happens:

  1. The AI receives the database schema and your question.
  2. It drafts a PostgreSQL SELECT query tailored to the schema.
  3. The query is executed via the read-only run_sql function.
  4. If the result is numeric or time-series, the AI emits a chart JSON block and the UI renders a Recharts visualization inline.
  5. Otherwise, the raw rows are shown in a sortable data table.

Use Cases Covered

  • Market Transparency Q&A - average price per sqft by community, top 10 areas by volume, year-over-year trends.
  • Investor Insights - median sale prices, rental yield proxies, hotspot detection by transaction volume.
  • Regulator Signals - outlier price spikes, unusual transfer patterns, area-level activity dashboards.
Built for the VALUS IRETP proposal - Dubai Land Department real-estate transparency initiative.