Data Not Syncing from Reltio to Databricks After Tenant Clone (Delta Lake / Data Pipeline Hub)

 

Related docs:

  • Tenant data synchronization:
    Reltio

  • Databricks Delta Live Tables trigger API (trigger_pipeline):
    Reltio

 

Problem

After cloning a Reltio tenant that is integrated with Databricks (Delta Lake), customers may see:

  • Databricks does not automatically receive the cloned data.

  • Data appears in Databricks only when new CRUD operations are performed on the cloned tenant.

  • After running syncToDataPipeline, Databricks shows:

    • More total rows than the number of entities shown in Reltio, and/or

    • Fewer distinct entities than Reltio, and/or

    • Very low relationship and activity counts compared to Reltio.

Customers commonly ask:

  • Is this expected behavior after a clone?

  • Do we need to manually trigger something for Databricks?

  • Is there a way to perform a full sync (hundreds of millions of records)?

  • How should old data in S3 be cleaned up?

This article explains why this happens and how to fix it.

 

Root Cause (High‑Level)

  1. Tenant clone does not re‑publish all data to Databricks

    As documented in the tenant synchronization docs:

    • A clone operation (for example, PROD → TEST) copies data and configuration from one Reltio tenant to another.

    • A clone does not automatically re‑publish all cloned data to Data Pipeline Hub or Databricks.

    • To stream the full tenant dataset downstream, you must explicitly run syncToDataPipeline.

  2. Databricks tables are not automatically cleared

    In Delta Lake mode with DPH:

    • Databricks tables are treated as the downstream store; they are not automatically truncated during clones or re‑syncs.

    • If you run syncToDataPipeline without resetting Databricks tables, Databricks ends up with:

      • Old data from the pre‑clone tenant state, plus

      • New data from the post‑clone tenant state.

    This can produce:

    • Inflated overall row counts (e.g., 1.4B rows for ~600M entities).

    • Lower distinct entity counts than Reltio (missing coverage for some entities).

  3. S3 lifecycle and overlapping history

    • DPH writes JSON files to an S3 staging path.

    • If S3 has a multi‑week lifecycle (e.g., 45 days), then during/after a clone, there may be:

      • Old pre‑clone events,

      • Clone/post‑clone events, and

      • Re‑sync events, all co‑existing.

    • Databricks Auto Loader / DLT treats S3 as append‑only. Without a full table refresh, it will:

      • Continue ingesting any visible files,

      • Potentially re‑process some files more than once,

      • Never “know” that the tenant was cloned.

  4. Non‑entity pipelines not running (relations/activities)

    Many Databricks setups use two pipelines:

    • Entity pipeline (entity tables)

    • Non‑entity pipeline (relations, activities, links, merges)

    If the non‑entity pipeline is disabled, misconfigured, or failing:

    • syncToDataPipeline may still publish relation/activity events into S3.

    • But the Databricks relation and activity tables remain empty or severely underpopulated.

    Symptoms:

    • Relationships in Reltio: thousands/millions; in Databricks: tens or hundreds.

    • Activities in Reltio: hundreds of millions; in Databricks: a small fraction.

 

Expected Behavior (Per Documentation)

The documented behavior is:

  • A clone does not cause a full re‑publish of cloned data to Data Pipeline Hub or Databricks.

  • To stream all tenant data to DPH, you must run syncToDataPipeline on the tenant.

  • For Databricks in Triggered mode, you must call trigger_pipeline on the Databricks adapter.

  • When Databricks tables contain stale/mixed content, the supported way to rebuild them is:

    • trigger_pipeline with "fullRefresh": true

    • This deletes existing table content for that adapter and recreates it from S3.


Diagnosis Steps

1. Confirm Integration Type

Check tenant configuration (from Jira, internal tools, or customer):

  • dataPipelineConfig.enabled = true

  • Adapter with "type": "deltalake" (not datashare)

  • Example adapter snippet:

    { "type": "deltalake", 
    "enabled": true, 
    "name": "Databricks<Something>Outbound", 
    "cloudProvider": "AWS",
     "stagingBucket": "<customer-outbound-bucket>/<path>", 
     "databricksConfig": { 
     "databricksHost": "https://<customer>.cloud.databricks.com/",
      "isContinuous": false, 
      "catalog": "<catalog-name>" } }

If type = datalake or dataShareEnabled = true, refer to the appropriate Datashare article instead.

2. Confirm the Clone & Sync Timeline

  • Clone date: When was the tenant cloned (source → target)?

  • Re‑sync date(s): When did they run syncToDataPipeline (if at all)?

  • Databricks observation date: When did they run their Databricks counts?

Look for patterns like:

  • Clone on Date A,

  • syncToDataPipeline run on Date B (weeks later),

  • Complaints about Databricks count on Date C.

This helps explain why a “pure snapshot” view may no longer exist.

3. Check Databricks Counts vs Reltio

Have the customer run simple counts in Databricks and provide Reltio counts for comparison.

Examples (Databricks SQL):

-- Entities 
SELECT COUNT(*) AS total_entities 
FROM <catalog>.<schema>.entity_organization;
-- Relationships 
SELECT COUNT(*) AS total_rels 
 FROM <catalog>.<schema>.relations_<type>; 
-- Activities
 SELECT COUNT(*) AS total_activities 
 FROM <catalog>.<schema>.activities;

Compare with:

  • Reltio UI or API entity count

  • Reltio relations and activities (or at least sample checks)

Flag if you see:

  • Databricks total rows are much higher than Reltio's entity count.

  • Databricks distinct entities lower than Reltio.

  • Databricks relation/activity counts are dramatically lower than Reltio.

4. Verify Databricks Pipelines

  • There is one pipeline for entities for relations/activities/links/merges (or equivalents).

  • pipeline should exist,  is enabled, and have successful runs after the last syncToDataPipeline.

If the non‑entity pipeline is disabled or failing:

  • That explains low relation/activity counts even if entities look mostly correct.

5. Check S3 Lifecycle and Cleanup

  • What is the retention period for the S3 staging bucket (e.g., 14 days, 45 days)?

  • Did they perform any bulk S3 cleanup (e.g., from X TB down to Y TB) before/after the re‑sync?

If old files were still present during syncToDataPipeline:

  • Databricks may have processed a combination of:

    • Pre‑clone events,

    • Clone/post‑clone events,

    • Re‑sync events.

 

Resolution

When Databricks contains a mixed or incomplete view after a tenant clone, the minimal, supported fix is:

  1. Ensure a good syncToDataPipeline run (if needed).

  2. Trigger a full Databricks refresh via trigger_pipeline with "fullRefresh": true.

  3. Confirm all relevant pipelines (entity + non‑entity) are enabled.

  4. Validate counts and spot‑check records.

Step 1 – (Optional) Run syncToDataPipeline for a Fresh Snapshot

If the last full sync was long ago or customer wants to re‑take a snapshot:

POST https://<env>.reltio.com/reltio/api/<tenantId>/syncToDataPipeline?distributed=true&taskPartsCount=<N>
  • distributed=true and an appropriate taskPartsCount (e.g., 32–64) is recommended for large tenants (hundreds of millions of records).

  • Advise scheduling during a quiet window.

  • Throughput and time expectations should be discussed (6–10 hours for ~600M records is typical, but depends on env).

Step 2 – Trigger Full Refresh in Databricks

Once S3 contains the desired event history:

Use the Data Pipeline Hub adapter action:

POST https://<env>-data-pipeline-hub.reltio.com/api/tenants/<tenantId>/adapters/<adapterName>/actions/trigger_pipeline Content-Type: application/json { "fullRefresh": true }
  • <adapterName> is typically the name field from the adapter config (e.g., DatabricksMCOutbound).

  • "fullRefresh": true runs the pipeline in FullRefreshAll mode:

    • Deletes existing data in the Databricks tables for this adapter.

    • Rebuilds those tables from the current S3 files.

This is the key step that removes “old era” data and ensures a clean, consistent dataset.

Step 3 – Ensure All Pipelines Participate

  • Confirm that the entity pipeline ran successfully in this full refresh.

  • Confirm that the relations/activities/links/merges pipeline also ran successfully.

If the non‑entity pipeline remains off/failing:

  • Relationships and activities will not be complete, even after fullRefresh.

Step 4 – Post‑Refresh Validation

After the pipelines finish:

  • Compare:

    • Total entities in Databricks vs Reltio (by entity type if possible).

    • Relationship counts by type vs Reltio.

    • Activity counts vs Reltio (if applicable).

  • Perform spot checks:

    • Pick a few URIs and compare:

      • Entity attributes,

      • Relationships,

      • Activities, in both Reltio and Databricks.

Document successful reconciliation in the Zendesk ticket.Prevention / Best Practices

For future clones with Databricks integration:

  1. Treat clone + Databricks as a controlled operation

    • Plan clone, S3 cleanup, syncToDataPipeline, and Databricks full refresh as a single flow.

  2. Always plan a Databricks full refresh after a major clone

    • Especially when using Triggered pipelines and long S3 retention.

  3. Verify pipeline health after changes

    • Ensure both entity and non‑entity pipelines are active and healthy.

  4. Reconciliation as a standard step

    • After clone + full refresh, always run basic counts and spot checks before UAT or go‑live.


 

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.