Related docs:
Tenant data synchronization:
ReltioDatabricks Delta Live Tables trigger API (
trigger_pipeline):
Reltio
Problem
After cloning a Reltio tenant that is integrated with Databricks (Delta Lake), customers may see:
Databricks does not automatically receive the cloned data.
Data appears in Databricks only when new CRUD operations are performed on the cloned tenant.
After running
syncToDataPipeline, Databricks shows:More total rows than the number of entities shown in Reltio, and/or
Fewer distinct entities than Reltio, and/or
Very low relationship and activity counts compared to Reltio.
Customers commonly ask:
Is this expected behavior after a clone?
Do we need to manually trigger something for Databricks?
Is there a way to perform a full sync (hundreds of millions of records)?
How should old data in S3 be cleaned up?
This article explains why this happens and how to fix it.
Root Cause (High‑Level)
Tenant clone does not re‑publish all data to Databricks
As documented in the tenant synchronization docs:
A clone operation (for example, PROD → TEST) copies data and configuration from one Reltio tenant to another.
A clone does not automatically re‑publish all cloned data to Data Pipeline Hub or Databricks.
To stream the full tenant dataset downstream, you must explicitly run
syncToDataPipeline.
Databricks tables are not automatically cleared
In Delta Lake mode with DPH:
Databricks tables are treated as the downstream store; they are not automatically truncated during clones or re‑syncs.
If you run
syncToDataPipelinewithout resetting Databricks tables, Databricks ends up with:Old data from the pre‑clone tenant state, plus
New data from the post‑clone tenant state.
This can produce:
Inflated overall row counts (e.g., 1.4B rows for ~600M entities).
Lower distinct entity counts than Reltio (missing coverage for some entities).
S3 lifecycle and overlapping history
DPH writes JSON files to an S3 staging path.
If S3 has a multi‑week lifecycle (e.g., 45 days), then during/after a clone, there may be:
Old pre‑clone events,
Clone/post‑clone events, and
Re‑sync events, all co‑existing.
Databricks Auto Loader / DLT treats S3 as append‑only. Without a full table refresh, it will:
Continue ingesting any visible files,
Potentially re‑process some files more than once,
Never “know” that the tenant was cloned.
Non‑entity pipelines not running (relations/activities)
Many Databricks setups use two pipelines:
Entity pipeline (entity tables)
Non‑entity pipeline (relations, activities, links, merges)
If the non‑entity pipeline is disabled, misconfigured, or failing:
syncToDataPipelinemay still publish relation/activity events into S3.But the Databricks relation and activity tables remain empty or severely underpopulated.
Symptoms:
Relationships in Reltio: thousands/millions; in Databricks: tens or hundreds.
Activities in Reltio: hundreds of millions; in Databricks: a small fraction.
Expected Behavior (Per Documentation)
The documented behavior is:
A clone does not cause a full re‑publish of cloned data to Data Pipeline Hub or Databricks.
To stream all tenant data to DPH, you must run
syncToDataPipelineon the tenant.For Databricks in Triggered mode, you must call
trigger_pipelineon the Databricks adapter.When Databricks tables contain stale/mixed content, the supported way to rebuild them is:
trigger_pipelinewith"fullRefresh": trueThis deletes existing table content for that adapter and recreates it from S3.
Diagnosis Steps
1. Confirm Integration Type
Check tenant configuration (from Jira, internal tools, or customer):
dataPipelineConfig.enabled = trueAdapter with
"type": "deltalake"(notdatashare)Example adapter snippet:
{ "type": "deltalake", "enabled": true, "name": "Databricks<Something>Outbound", "cloudProvider": "AWS", "stagingBucket": "<customer-outbound-bucket>/<path>", "databricksConfig": { "databricksHost": "https://<customer>.cloud.databricks.com/", "isContinuous": false, "catalog": "<catalog-name>" } }
If type = datalake or dataShareEnabled = true, refer to the appropriate Datashare article instead.
2. Confirm the Clone & Sync Timeline
Clone date: When was the tenant cloned (source → target)?
Re‑sync date(s): When did they run
syncToDataPipeline(if at all)?Databricks observation date: When did they run their Databricks counts?
Look for patterns like:
Clone on Date A,
syncToDataPipelinerun on Date B (weeks later),Complaints about Databricks count on Date C.
This helps explain why a “pure snapshot” view may no longer exist.
3. Check Databricks Counts vs Reltio
Have the customer run simple counts in Databricks and provide Reltio counts for comparison.
Examples (Databricks SQL):
-- Entities SELECT COUNT(*) AS total_entities FROM <catalog>.<schema>.entity_organization; -- Relationships SELECT COUNT(*) AS total_rels FROM <catalog>.<schema>.relations_<type>; -- Activities SELECT COUNT(*) AS total_activities FROM <catalog>.<schema>.activities;
Compare with:
Reltio UI or API entity count
Reltio relations and activities (or at least sample checks)
Flag if you see:
Databricks total rows are much higher than Reltio's entity count.
Databricks distinct entities lower than Reltio.
Databricks relation/activity counts are dramatically lower than Reltio.
4. Verify Databricks Pipelines
There is one pipeline for entities for relations/activities/links/merges (or equivalents).
pipeline should exist, is enabled, and have successful runs after the last
syncToDataPipeline.
If the non‑entity pipeline is disabled or failing:
That explains low relation/activity counts even if entities look mostly correct.
5. Check S3 Lifecycle and Cleanup
What is the retention period for the S3 staging bucket (e.g., 14 days, 45 days)?
Did they perform any bulk S3 cleanup (e.g., from X TB down to Y TB) before/after the re‑sync?
If old files were still present during syncToDataPipeline:
Databricks may have processed a combination of:
Pre‑clone events,
Clone/post‑clone events,
Re‑sync events.
Resolution
Recommended Remediation Pattern
When Databricks contains a mixed or incomplete view after a tenant clone, the minimal, supported fix is:
Ensure a good
syncToDataPipelinerun (if needed).Trigger a full Databricks refresh via
trigger_pipelinewith"fullRefresh": true.Confirm all relevant pipelines (entity + non‑entity) are enabled.
Validate counts and spot‑check records.
Step 1 – (Optional) Run syncToDataPipeline for a Fresh Snapshot
If the last full sync was long ago or customer wants to re‑take a snapshot:
POST https://<env>.reltio.com/reltio/api/<tenantId>/syncToDataPipeline?distributed=true&taskPartsCount=<N>
distributed=trueand an appropriatetaskPartsCount(e.g., 32–64) is recommended for large tenants (hundreds of millions of records).Advise scheduling during a quiet window.
Throughput and time expectations should be discussed (6–10 hours for ~600M records is typical, but depends on env).
Step 2 – Trigger Full Refresh in Databricks
Once S3 contains the desired event history:
Use the Data Pipeline Hub adapter action:
POST https://<env>-data-pipeline-hub.reltio.com/api/tenants/<tenantId>/adapters/<adapterName>/actions/trigger_pipeline Content-Type: application/json { "fullRefresh": true }<adapterName>is typically thenamefield from the adapter config (e.g.,DatabricksMCOutbound)."fullRefresh": trueruns the pipeline in FullRefreshAll mode:Deletes existing data in the Databricks tables for this adapter.
Rebuilds those tables from the current S3 files.
This is the key step that removes “old era” data and ensures a clean, consistent dataset.
Step 3 – Ensure All Pipelines Participate
Confirm that the entity pipeline ran successfully in this full refresh.
Confirm that the relations/activities/links/merges pipeline also ran successfully.
If the non‑entity pipeline remains off/failing:
Relationships and activities will not be complete, even after fullRefresh.
Step 4 – Post‑Refresh Validation
After the pipelines finish:
Compare:
Total entities in Databricks vs Reltio (by entity type if possible).
Relationship counts by type vs Reltio.
Activity counts vs Reltio (if applicable).
Perform spot checks:
Pick a few URIs and compare:
Entity attributes,
Relationships,
Activities, in both Reltio and Databricks.
Document successful reconciliation in the Zendesk ticket.Prevention / Best Practices
For future clones with Databricks integration:
Treat clone + Databricks as a controlled operation
Plan clone, S3 cleanup,
syncToDataPipeline, and Databricks full refresh as a single flow.
Always plan a Databricks full refresh after a major clone
Especially when using Triggered pipelines and long S3 retention.
Verify pipeline health after changes
Ensure both entity and non‑entity pipelines are active and healthy.
Reconciliation as a standard step
After clone + full refresh, always run basic counts and spot checks before UAT or go‑live.
Comments
Please sign in to leave a comment.