Skip to content

Data Platform Architecture

The analytics universe is structured as a five-layer pipeline. Data flows in one direction — from transactional stores into the lake and upward to visualization — and never writes back.

Layer 1: Transactional DBs (PostgreSQL, MongoDB, MSSQL, MySQL)
Layer 2: Integration Layer (non-blocking background writes)
Layer 3: Data Lake — ADLS (raw, append-only Parquet files)
Layer 4: Consumption — Dremio (Models → Explores)
Layer 5: Visualization — Superset (dashboards, embedded, RLS-filtered)

The source of all data. Each microservice owns its database and is the sole writer to it. See the Data Platform Overview for the full list of transactional stores.


The bridge that moves data from transactional stores into the Data Lake. This layer is intentionally kept thin.

MechanismWhen to use
In-process async writeLow-latency, simple payloads; acceptable if data lake ingestion lags slightly behind transactional commit
CDC (Change Data Capture)High-volume tables; need to capture all changes including deletes
Scheduled jobs / ETLBatch workloads; external systems that don’t support CDC
Event streamsEvent-driven services that already publish domain events

Current implementation: Services write directly to ADLS as a non-blocking side effect of the transactional operation. The transactional response is returned to the caller immediately; the ADLS write happens in a parallel background thread. The caller is never blocked waiting for the data lake ingestion to complete.


Tool: Azure Data Lake Storage (ADLS)

All data entering the analytics universe lands here first. ADLS holds raw, append-only copies of transactional data — exactly as produced by the source service, with no transformation applied.

  • File format: Parquet or Iceberg only. No JSON, CSV, or other formats.
  • Path pattern (mandatory):
    {namespace}/{table_name}/tenant={uuid}/year={YYYY}/month={MM}/day={DD}/{filename}.parquet
  • Namespace: the source application name (e.g. workforce-management, safety-compliance, worker-monitoring). This is the top-level grouping in the data lake and keeps tables from different services from colliding.
  • File naming: flexible. When writing one record per file, a UUID is recommended to prevent collisions. When grouping multiple records into a single batch file, any unique, descriptive name is acceptable (e.g. batch_2026-03-16T00:00:00Z.parquet).
  • Immutability: never update or delete files. Corrections are appended as new files; deduplication is handled downstream.
  • Metadata fields: every record must carry ingested_at (timestamp of ADLS write) and source_service alongside the business payload.
  • Tenant field: tenant is encoded in the path and must be present as a column in every record so Dremio can filter it independently of the partition.
adls://angelis-datalake/
├── workforce-management/
│ ├── companies/
│ │ └── tenant=<uuid>/year=2026/month=03/day=16/
│ │ └── 7f3a1c2d-....parquet
│ ├── profiles/
│ │ └── tenant=<uuid>/year=2026/month=03/day=16/
│ │ └── b2e94f01-....parquet
│ └── assignments/
│ └── tenant=<uuid>/year=2026/month=03/day=16/
│ └── 1a8d3c55-....parquet
└── safety-compliance/
└── form_submissions/
└── tenant=<uuid>/year=2026/month=03/day=16/
└── 9c0f2b44-....parquet

Tool: Dremio

Dremio sits on top of ADLS and exposes a SQL interface for all downstream consumers (BI tools, dashboards, ad-hoc queries). It defines two sub-layers: Models and Explores.

Once data is loaded into ADLS, making it queryable in Dremio is purely configuration — no code required.

  1. Navigate to the folder in the Dremio data lake browser.
  2. Right-click the folder and select “Convert to Table”.
  3. (Optional) Enable a Reflection on the table to cache it and speed up downstream queries.

Raw SQL mappings over the Data Lake tables. A model is a 1:1 structured view of a data lake entity — same columns, explicit types, no joins. It replaces opaque file paths with named, queryable tables.

-- Example model: wfm_profiles
SELECT
id,
company_id,
rut,
first_name,
last_name,
email,
CAST(created_at AS TIMESTAMP) AS created_at,
CAST(deleted_at AS TIMESTAMP) AS deleted_at,
ingested_at
FROM datalake.workforce_management.profiles
WHERE deleted_at IS NULL

Rules for models:

  • One model per source entity.
  • Apply only structural transformations: casting, null handling, renaming to consistent naming conventions.
  • No business logic, no joins.
  • Always filter soft-deleted records (deleted_at IS NULL).

Star-schema views built on top of models. An explore combines a central fact model with all relevant dimension models through pre-defined joins, producing a single flat, fully-joined table ready for visualization.

explore_worker_compliance
├── fact: wfm_assignments (central model)
├── dim: wfm_profiles (worker details)
├── dim: wfm_companies (company details)
├── dim: wfm_projects (project details)
└── dim: saferapp_test_results (test outcomes)

Rules for explores:

  • One explore per business question / domain area.
  • Always include company_id so Superset can enforce tenant filtering.
  • Explores are the only layer Superset queries. Never connect Superset directly to a model or raw ADLS path.
  • Both models and explores can be cached by Dremio — use reflection / caching for explores that power frequent dashboards.

Layer 5 — Visualization Layer (Superset)

Section titled “Layer 5 — Visualization Layer (Superset)”

Tool: Apache Superset

Superset connects to Dremio explores and is the end-user interface for dashboards and charts.

Tenant isolation via Row-Level Security (RLS)

Section titled “Tenant isolation via Row-Level Security (RLS)”

Every Superset dataset maps to a Dremio explore that contains company_id. An RLS filter is applied per user session so a user can only query rows where company_id matches their tenant. This is enforced in Superset, not in the application.

RLS rule example:
dataset: explore_worker_compliance
clause: company_id = '{{ current_user_attribute("company_id") }}'

Superset supports embedding dashboards directly into the Angelis frontend applications. Embedded dashboards carry the authenticated user’s context, which drives the RLS filter above. Users never see raw data from other tenants.

Superset’s theme is configurable. Dashboards should match the Angelis design system — primary colors, typography, and layout consistent with AdminCenter and Worker Web.