ClickHouseCRMData IntegrationOLAP

Building a CRM Analytics Dashboard with ClickHouse: From Schema to Real-Time Insights

UUnknown

2026-01-21

11 min read

Practical guide to model CRM events, stream data into ClickHouse, and build real-time lifecycle dashboards with materialized views and deduplication.

Stop drowning in raw CRM logs — get real-time lifecycle insight with ClickHouse

If your team struggles to join sales records, product events, and engagement signals into a single, performant view of customers, you're not alone. In 2026 the shift toward event-driven CRM stacks and real-time OLAP is no longer optional — it's how revenue and retention teams get ahead. This guide shows a practical, end-to-end approach to model CRM events, ingest sales and engagement streams into ClickHouse, and build real-time dashboards for customer lifecycle metrics that scale.

Why ClickHouse for CRM analytics in 2026?

ClickHouse's OLAP engine is optimized for high-cardinality, high-throughput event workloads. Since its big funding round in late 2025 and continued product updates into 2026, adoption has accelerated for real-time analytics tied to customer data. Key reasons to choose ClickHouse for CRM analytics:

Real-time ingestion via Kafka, HTTP, or native connectors — low-latency lookups for dashboards and alerts.
Fast aggregation with MergeTree family engines (AggregatingMergeTree, SummingMergeTree, ReplacingMergeTree) designed for precalc and rollups.
Cost-effective storage and compression for event history at scale.
Developer-friendly SQL plus materialized views and table engines to enforce deduplication and pre-aggregation.

What you'll accomplish (quick summary)

Design an event schema optimized for CRM use-cases and cardinality control.
Ingest sales, email, product, and support events (CDC + streams) into ClickHouse.
Implement deduplication and late-arrival handling with ReplacingMergeTree and version columns.
Create real-time materialized views and rollups for lifecycle metrics: activation, retention, churn, MRR.
Expose dashboards and embedded visualizations for product and GTM teams.

1) Model your CRM event schema — build for query patterns

Start with the queries you want to run. Typical CRM lifecycle metrics include activation funnels, time-to-first-purchase, weekly active users per cohort, churn rate, and MRR by cohort. Design a single events table that supports those queries efficiently.

Core schema (recommended)

Use a denormalized events table as the raw landing zone, then create materialized views for analytics rollups.

CREATE TABLE crm_events_raw (
  event_id String,
  event_type LowCardinality(String), -- e.g. "signup", "purchase", "email_open"
  user_id String,
  account_id String,
  event_time DateTime64(3),
  ingestion_time DateTime64(3) DEFAULT now(),
  properties String, -- JSON blob or use Nested if you need typed fields
  source LowCardinality(String), -- e.g. "web", "mobile", "support"
  version UInt64 -- for deduplication/versioning
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (user_id, event_time, event_type)
SETTINGS index_granularity = 8192;

Notes: use LowCardinality for repeating small sets (event_type, source) to reduce memory. Keep the raw JSON in properties for flexibility; extract and promote hot fields into columns for frequent filters (e.g., plan_id, revenue_amount).

Promote hot fields

Identify frequent fields (amount, plan_id, is_trial) and promote them to native columns. Use typed columns for numeric fields (Float64/Decimal) because aggregation over JSON is slower.

2) Ingesting streams: put streaming ingestion and CDC in place

Most enterprise CRM stacks in 2026 are multi-source: Postgres for transactional sales, a product event pipeline (Segment/Matomo), email provider webhooks, and support systems. Build a robust stream pipeline using proven components:

Source DB CDC → Debezium → Kafka
Event SDKs or webhooks → Kafka (or HTTP to ClickHouse for small bursts)
ClickHouse Kafka engine reader → materialized views → analytic tables

Kafka engine ingestion example

Create a Kafka table that reads messages and a materialized view that transforms and inserts into the MergeTree landing table.

CREATE TABLE kafka_crm_events (
  key String,
  value String,
  topic String
) ENGINE = Kafka
SETTINGS
  kafka_broker_list = 'kafka:9092',
  kafka_topic_list = 'crm-events',
  kafka_group_name = 'clickhouse-crm-consumers',
  format = 'JSONEachRow';

CREATE MATERIALIZED VIEW mv_kafka_to_crm_events TO crm_events_raw AS
SELECT
  JSONExtractString(value, 'event_id') AS event_id,
  JSONExtractString(value, 'event_type') AS event_type,
  JSONExtractString(value, 'user_id') AS user_id,
  JSONExtractString(value, 'account_id') AS account_id,
  parseDateTimeBestEffort(JSONExtractString(value, 'event_time')) AS event_time,
  now() AS ingestion_time,
  JSONExtractString(value, 'properties') AS properties,
  JSONExtractString(value, 'source') AS source,
  toUInt64(JSONExtractInt(value, 'version', 0)) AS version
FROM kafka_crm_events;

This pattern gives you durable, scalable ingestion and isolates transformation logic in ClickHouse. For integrators and teams adopting real-time patterns, see the Real-time Collaboration APIs Integrator Playbook for common connector topologies.

3) Deduplication and late arrivals

Two persistent problems: duplicate events and late-arriving events. Use ReplacingMergeTree with a version column (or timestamp) to keep the latest row per event_id or (user_id,event_type,event_time) key. For transactional CDC, consider a sign column to support row-level deletes using CollapsingMergeTree.

CREATE TABLE crm_events_deduped (
  event_id String,
  event_type LowCardinality(String),
  user_id String,
  account_id String,
  event_time DateTime64(3),
  properties String,
  source LowCardinality(String),
  version UInt64
) ENGINE = ReplacingMergeTree(version)
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_id);

-- Materialized view to populate deduped table from raw
CREATE MATERIALIZED VIEW mv_raw_to_deduped TO crm_events_deduped AS
SELECT * FROM crm_events_raw;

With ReplacingMergeTree you can safely re-ingest corrected events; the merge process keeps the row with the highest version.

4) Pre-aggregate with materialized views for real-time dashboards

Materialized views let you compute near-real-time aggregates as events arrive, keeping dashboards fast and reducing ad-hoc query cost. Use AggregatingMergeTree or SummingMergeTree depending on whether you need complex aggregate states or simple sums.

Daily active users (DAU) and weekly retention example

CREATE TABLE dau_daily (
  day Date,
  account_id String,
  dau UInt32
) ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(day)
ORDER BY (account_id, day);

CREATE MATERIALIZED VIEW mv_dau TO dau_daily AS
SELECT
  toDate(event_time) AS day,
  account_id,
  countDistinctExact(user_id) AS dau
FROM crm_events_deduped
WHERE event_type IN ('login', 'app_open', 'purchase')
GROUP BY day, account_id;

For retention cohorts, precompute first-touch dates and then join. A common pattern is storing user_first_seen and then computing retention via joins or precomputed cohort tables.

Cohort table (first seen + activity)

CREATE TABLE user_first_seen (
  user_id String,
  account_id String,
  first_seen Date,
  version UInt64
) ENGINE = ReplacingMergeTree(version)
PARTITION BY toYYYYMM(first_seen)
ORDER BY (user_id);

CREATE MATERIALIZED VIEW mv_first_seen TO user_first_seen AS
SELECT
  user_id,
  account_id,
  toDate(min(event_time)) AS first_seen,
  max(version) AS version
FROM crm_events_deduped
GROUP BY user_id, account_id;

-- Daily cohort activity rollup
CREATE TABLE cohort_daily_activity (
  cohort_date Date,
  activity_date Date,
  account_id String,
  active_users UInt32
) ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(activity_date)
ORDER BY (account_id, cohort_date, activity_date);

CREATE MATERIALIZED VIEW mv_cohort_activity TO cohort_daily_activity AS
SELECT
  u.first_seen AS cohort_date,
  toDate(e.event_time) AS activity_date,
  e.account_id,
  countDistinctExact(e.user_id) AS active_users
FROM crm_events_deduped e
JOIN user_first_seen u USING user_id
GROUP BY cohort_date, activity_date, e.account_id;

These pre-aggregated tables make retention visualizations instant for dashboard users.

5) Customer revenue and MRR pipelines

Revenue metrics need deterministic handling for refunds, plan changes, and credit notes. Use a dedicated billing_events stream with strong event typing and amounts typed as Decimal(18,2).

CREATE TABLE billing_events (
  event_id String,
  event_time DateTime64(3),
  user_id String,
  account_id String,
  amount Decimal(18,2),
  revenue_type LowCardinality(String), -- "charge", "refund", "discount"
  currency String,
  version UInt64
) ENGINE = ReplacingMergeTree(version)
PARTITION BY toYYYYMM(event_time)
ORDER BY (account_id, event_time);

-- Monthly MRR rollup
CREATE TABLE mrr_monthly (
  month Date,
  account_id String,
  mrr Decimal(18,2)
) ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(month)
ORDER BY (account_id, month);

CREATE MATERIALIZED VIEW mv_mrr_monthly TO mrr_monthly AS
SELECT
  toStartOfMonth(event_time) AS month,
  account_id,
  sumIf(amount, revenue_type = 'charge') - sumIf(amount, revenue_type = 'refund') AS mrr
FROM billing_events
GROUP BY month, account_id;

6) Performance tuning — practical knobs for CRM workloads

Performance is a combination of schema choices, storage engine, and query design. Tweak these areas for dramatic gains:

Partitioning: partition by month (toYYYYMM(event_time)) for time-series workloads. For very large customers, consider (toYYYYMM(event_time), account_id) partitions — also see hybrid hosting strategies for very large, geo-distributed clusters (Hybrid Edge–Regional Hosting Strategies).
ORDER BY: choose an ORDER BY that supports your primary queries — user_id then event_time is common for user-centric analytics.
LowCardinality and Enum: convert high-repetition string columns to LowCardinality(String) or Enum to reduce memory and improve GROUP BY speed.
AggregatingMergeTree: use it for complex aggregate states (running aggregates) and SummingMergeTree for numeric rollups.
Compression codec: use zstd or LZ4 with tailored levels for column types. ClickHouse in 2026 continues to improve codecs — test with representative slices.
Index granularity: default 8192 is fine for many workloads; lower for highly selective range queries.

CRM systems change: new event types, additional properties, and regulatory deletion requests. Recommended approach:

Keep a stable set of typed columns for the queries you rely on; new properties go into JSON properties. For live changes and zero-downtime updates, consider patterns described in Live Schema Updates and Zero-Downtime Migrations.
For deletions use CollapsingMergeTree or insert a tombstone event and run periodic merges/mutations. Note that ClickHouse is append-only — deletions are handled via merges or TTL mutations. Also align deletion processes with privacy engineering guidance such as Privacy by Design for TypeScript APIs and platform compliance playbooks (Regulation & Compliance for Specialty Platforms).
Use TTL for data retention policies (e.g., delete raw events older than X years):

ALTER TABLE crm_events_deduped MODIFY TTL event_time + INTERVAL 3 YEAR;

8) Dashboarding: tools and embedding patterns

Choose a dashboard tool that supports:

Native ClickHouse connectors or a fast SQL layer
Real-time refresh / polling and incremental queries
Embedding and granular permissions for internal teams — if you're building embeddable micro-UIs, marketplaces for UI components can accelerate delivery (see component marketplace examples).

Popular choices in 2026:

Superset — flexible SQL + extensive visualization plugins.
Grafana — great for time-series and metric-focused dashboards; new ClickHouse data source plugins support SQL and Loki-like queries.
Dataviewer.cloud — embeddable explorers and live dashboards designed for product integrations and internal portals.

Dashboard design tips for lifecycle metrics

Use pre-aggregated tables (dau_daily, cohort_daily_activity, mrr_monthly) as the source to keep dashboards snappy.
Expose drill-downs: from cohort-level to user-level transactions via keyed lookups (user_id → recent events table).
Materialize expensive time-window queries (e.g., 7-day rolling retention) into separate tables refreshed by materialized views or scheduled merges.
For near-real-time UIs, implement websocket or server-sent events that trigger client refreshes when materialized views update or when Kafka topics receive new events — for real-time integrations patterns see real-time integrator guidance.

9) Real-world example: compute 7-day retention in near real-time

Below is a practical query pattern that uses precomputed cohort and activity tables. It assumes you have user_first_seen and cohort_daily_activity tables described above.

SELECT
  cohort_date,
  activity_date,
  account_id,
  active_users,
  round(active_users / cohort_size, 4) AS retention_rate
FROM (
  SELECT
    c.cohort_date,
    c.activity_date,
    c.account_id,
    sum(c.active_users) AS active_users,
    (SELECT count() FROM user_first_seen u WHERE toDate(u.first_seen) = c.cohort_date AND u.account_id = c.account_id) AS cohort_size
  FROM cohort_daily_activity c
  WHERE c.activity_date <= today()
  GROUP BY cohort_date, activity_date, account_id
) ORDER BY cohort_date DESC, activity_date;

Using cohort_daily_activity means this query returns instantly for dashboards — the heavy computation was moved into materialized views.

10) Monitoring and observability

Instrument your ClickHouse cluster and ingestion pipeline. Track:

Kafka consumer lag for ClickHouse Kafka tables
Merge rates and mutation queue lengths
Query latency percentiles and hot partitions
Disk utilization and compressed bytes per table

Tip: in 2026 many teams combine ClickHouse metrics with OpenTelemetry traces to correlate ingestion delays with downstream dashboard freshness. For tooling and platform reviews that help with this instrumentation, see our roundup of top monitoring platforms for reliability engineering.

Advanced strategies and future-proofing

For large-scale enterprise CRM analytics, consider these advanced patterns:

Hybrid storage: cold store raw events in object storage and keep hot rollups in ClickHouse; use external dictionaries for user metadata. See hybrid edge and regional hosting strategies for guidance on balancing locality and cost.
Adaptive rollups: maintain daily/hourly rollups and promote hourly for active accounts only to save compute — this aligns with broader edge-first operational patterns.
Vectorization and embeddings: in 2026 many teams enrich CRM profiles with semantic embeddings for intent scoring — store embeddings in a vector store, and join top-N matches into ClickHouse dashboards via precomputed lookup tables.
Data contracts: adopt schema contracts and contract tests for event producers to avoid breaking changes and ingestion failures.

Common pitfalls and how to avoid them

Anti-pattern — storing everything as JSON only: JSON is flexible but slow for aggregation. Promote hot fields to typed columns.
Anti-pattern — no dedupe/versioning: leads to inflated metrics. Use version columns and ReplacingMergeTree or CollapsingMergeTree.
Anti-pattern — heavy joins on raw tables: precompute frequently needed joins into rollups or materialized views.

Actionable checklist (next 30 days)

Instrument your event producers to include event_id, event_time, user_id, account_id, and version/hash.
Deploy a Kafka topic for CRM events and set up Debezium for DB CDC if needed.
Create a ClickHouse Kafka engine table + materialized view to populate a typed raw table.
Implement ReplacingMergeTree for deduplication and materialized views for DAU, cohort, and MRR rollups.
Connect a dashboard tool (Superset/Grafana/Dataviewer.cloud) and point it at the rollup tables for fast visualizations — if you need a quick environment, follow a cloud migration checklist to spin up a test cluster safely.

2026 trends that matter to CRM analytics

As of early 2026, three trends should shape your architecture:

Real-time OLAP is mainstream: teams expect near-instant dashboards on event data — ClickHouse and managed offerings have focused on lower-latency ingestion and faster merges.
Serverless connectors and managed Kafka: reduce ops overhead — many vendors provide turnkey Kafka-to-ClickHouse connectors.
Data contracts and privacy-awareness: event schema governance and built-in deletion/TTL features are essential for compliance.

Final notes

ClickHouse gives a powerful foundation for CRM analytics when you combine a thought-out event schema, robust streaming ingestion, deduplication, and pre-aggregated materialized views. The real wins come from shifting heavy computation out of dashboards and into streaming transforms and rollups so product and GTM teams get reliable, fast answers.

Call to action

Ready to ship a production CRM analytics pipeline? Start with one KPI (e.g., 7-day retention) and implement the streaming-to-materialized-view pattern above. If you'd like a head start, spin up a ClickHouse test cluster and connect it to dataviewer.cloud for instant embeddable dashboards and prebuilt CRM visualizations. Book a free walkthrough with our engineering team to map this architecture to your stack and get a tailored onboarding plan.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.