How to Identify Merged and Unmerged (Split/Unmerge) Profiles in DPH / Databricks?

Overview

When profiles are merged in Reltio, the lineage is captured downstream in DPH/Databricks via merges and links tables. When profiles are unmerged (split), the event is captured in the activities table (via an items array containing an event type such as ENTITIES_SPLITTED).

Because DPH streams events (CREATE/UPDATE/MERGE/DELETE, etc.), you don’t typically get a single “this entity is unmerged” flag. Instead, you detect “unmerged/split activity” by querying activities for split events and then cross-checking lineage in merges/links.

What you can and can’t “see” in Databricks

Merged entities

You can identify merges directly from the merges dataset (winner/loser, event timestamp, ingest time, etc.).

“Unmerged entities”

You typically can’t list “unmerged entities” as a direct event state in DPH, because “unmerged” is not a persistent marker—DPH is event-driven. In practice:

  • An “unmerged entity” is usually interpreted as:

    • not deleted, and

    • not currently represented as a loser in the merge lineage, and/or

    • appears in split/unmerge activity events (if you’re specifically tracking split operations)

So you detect split/unmerge via activities, then validate lineage via links/merges.


Datasets involved

  • activities: Contains an items array; each item has eventType, objectUri, and data (payload).

  • merges: Merge lineage and timestamps (winnerId, loserId, timestamp, insertedTime, and possibly deleted).

  • links: Canonical mapping of loser → winner across time; includes deleted state for link records.

  • entities: Entity lifecycle state, including deleted and endDate (soft delete time).


Query 1 — Check merge lineage and ingest lag (merged entities)

Use this when you have entity IDs and want to inspect merge events and how quickly they arrived in Databricks.

WITH merged AS (
  SELECT
    winnerId,
    loserId,
    to_timestamp(timestamp/1000) AS merge_event_ts,
    to_timestamp(insertedTime/1000) AS merge_ingest_ts
  FROM <catalog>.<schema>.merges
  WHERE winnerId IN ('entityId1', 'entityId2')
     OR loserId  IN ('entityId1', 'entityId2')
)
SELECT
  CASE
    WHEN winnerId IN ('entityId1', 'entityId2') THEN winnerId
    ELSE loserId
  END AS entity_uri,
  merge_event_ts,
  merge_ingest_ts,
  (merge_ingest_ts - merge_event_ts) AS merge_ingest_lag
FROM merged
ORDER BY merge_ingest_ts;

Replace:

  • <catalog>.<schema> with your Databricks catalog/schema

  • entityId1, entityId2 with real entity IDs

Query 2 — List split/unmerge events directly from Activities

This is the primary way to identify split/unmerge activity.

SELECT
  a.activityTimestamp,
  a.activityUser,
  i.eventType,
  i.objectUri          AS affectedObjectUri,
  i.data               AS eventPayload
FROM   <catalog>.<schema>.activities a
LATERAL VIEW explode(a.items) AS i
WHERE  i.eventType = 'ENTITIES_SPLITTED'
ORDER BY a.activityTimestamp DESC;

What this gives you

  • Who performed the action (activityUser)

  • When it occurred (activityTimestamp)

  • Which object(s) were affected (objectUri)

  • Payload details (data)

Query 3 — See which entity URIs were involved in recent splits

Useful when you want a compact list of split events with the raw object references.

WITH splits AS (
  SELECT a.activityTimestamp, i.objectUri, i.data
  FROM   <catalog>.<schema>.activities a
  LATERAL VIEW explode(a.items) AS i
  WHERE  i.eventType = 'ENTITIES_SPLITTED'
)
SELECT *
FROM splits
ORDER BY activityTimestamp DESC;

Query 4 — Cross-check lineage around split timestamps using merges

Once you have split-related objectUris, cross-check whether those entities appear in merge history.

SELECT m.timestamp, m.winnerId, m.loserId, m.type, m.deleted
FROM   <catalog>.<schema>.merges m
WHERE  m.winnerId IN (
        SELECT regexp_extract(i.objectUri, 'entities/([^"]+)', 1)
        FROM   <catalog>.<schema>.activities a
        LATERAL VIEW explode(a.items) AS i
        WHERE  i.eventType = 'ENTITIES_SPLITTED'
      )
  OR  m.loserId IN (
        SELECT regexp_extract(i.objectUri, 'entities/([^"]+)', 1)
        FROM   <catalog>.<schema>.activities a
        LATERAL VIEW explode(a.items) AS i
        WHERE  i.eventType = 'ENTITIES_SPLITTED'
      )
ORDER BY m.timestamp DESC;

Validation queries for confirming “split state” and distinguishing from deletes

Inspect the LINKS history during a suspected unmerge window

What to look for: a prior loser→winner mapping later marked as deleted = true.

SELECT
  winnerId, loserId, deleted, version, timestamp, insertedTime
FROM <catalog>.<schema>.links
WHERE (winnerId IN ('<known_winner_id>') OR loserId IN ('<known_loser_id>'))
ORDER BY timestamp;

Inspect MERGES rows around the same entities

What to look for: whether historical merge rows later show deleted = true or whether new rows appear.

SELECT *
FROM <catalog>.<schema>.merges
WHERE winnerId = '<known_winner_id>' OR loserId = '<known_loser_id>'
ORDER BY timestamp;

Confirm entity soft-delete vs unmerge (Entities table)

Rule of thumb: unmerge should not soft-delete entities.
Soft delete shows deleted=true and an endDate.

SELECT uri, deleted, startDate, endDate, commitTime
FROM <catalog>.<schema>.entities
WHERE uri IN ('entities/<id_A>', 'entities/<id_B>');

Troubleshooting checklist

  • No split events found? Confirm the correct eventType value used in your pipeline (ENTITIES_SPLITTED vs other naming variants).

  • Downstream tables missing deletes/loser updates? Ensure your pipeline is configured to propagate removed records (commonly reindexDeleted=true in relevant connector settings).

  • Conflicting lineage results? Treat links as the canonical mapping history for winner resolution, and join the downstream relationship datasets to links to resolve the current winners.

Field semantics to know

Entities table

  • deleted (boolean): indicates if the entity is deleted

  • endDate: time the entity was soft deleted (if deleted)

Links table

  • deleted (boolean): indicates whether the lineage mapping row is deleted

  • Links represent the canonical loser → winner mapping history

Merges table

  • deleted (boolean): indicates whether the entity is deleted (as represented in merge dataset semantics)

 

 

 

 

 

 

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.