What does reindexDeleted=true do in the syncToDataTable task?

Question

When you add the parameter reindexDeleted=true to the syncToDataPipeline endpoint, it changes the behavior of the sync operation as follows:

  • Purpose: It ensures that events for removed (deleted or loser) entities and relations are also sent to the Data Pipeline Hub (and thus to downstream adapters like BigQuery, Snowflake, etc.).

  • Default Behavior: By default, deleted entities and relations are NOT included in the sync.

  • With reindexDeleted=true: The sync will include both active and deleted (or loser) records, helping to maintain consistency in downstream systems, especially after merges or deletions.

 

Example Usage

POST https://{env}.reltio.com/reltio/api/{tenantID}/syncToDataPipeline?reindexDeleted=true
  • This will trigger the sync of all data, including deleted/loser records.

Important Notes

  • If you specify an exact dataType (e.g., dataTypes=entities), the reindexDeleted parameter will not have any effect unless you also include the deleted type (e.g., dataTypes=entities,deleted_entities).

  • This is particularly useful for keeping downstream data stores in sync with the true state of your Reltio tenant, including records that have been removed due to merges or deletions.

Here are all the primary use cases for when you should use reindexDeleted=true with the syncToDataPipeline API in Reltio:

 

Summary Table:

Use Case

Why Use reindexDeleted=true?

Downstream sync of deletions

Data consistency

Post-merge clean-up

Remove “loser” records

Bulk deletions

Compliance, accuracy

Full sync/reconciliation

Complete data reflection

Audit/data lineage

Traceability

Reporting consistency

Prevent stale data

Incremental CDC

Capture all changes

Disaster recovery

Restore correct state

 


Synchronizing Deletions to Downstream Systems

  • Scenario: You have downstream systems (like BigQuery, Snowflake, or other analytics/data lakes) that must reflect not only active data but also deletions (including merged-away “loser” records and deleted relations/entities).

  • Why: By default, deletions are not sent to the pipeline. Using reindexDeleted=true ensures that downstream systems can accurately mirror the current state of your Reltio tenant, including removals.

 

Post-Merge Clean-Up

  • Scenario: After a large merge operation, you want to ensure that all “loser” entities (the records that were merged into a survivor and are now logically deleted) are also removed or flagged in downstream data stores.

  • Why: Without this, downstream systems may retain obsolete/merged-away records, leading to data quality issues.

 

Bulk Deletion Events

  • Scenario: You have performed bulk deletions (e.g., data clean-up, GDPR/CCPA compliance, or periodic purges) and need to propagate these deletions to all connected data pipelines.

  • Why: Ensures regulatory compliance and data consistency across all systems.

 

Initial Full Sync or Data Reconciliation

  • Scenario: You are performing a full re-sync or reconciliation between Reltio and a downstream system, and you want to guarantee that all deletions (historical and recent) are included in the sync.

  • Why: This is critical for initial loads or after a major data correction, so the downstream system is a true reflection of the Reltio tenant.

 

Audit and Data Lineage Requirements

  • Scenario: Your organization requires a complete audit trail or lineage, including when and which records were deleted or merged away.

  • Why: Including deleted records in the sync supports auditability and traceability in downstream analytics or compliance systems.

 

Downstream Data Warehouse/Reporting Consistency

  • Scenario: Your reporting or analytics relies on the downstream data warehouse being an exact replica of Reltio, including all deletes and merges.

  • Why: Prevents reporting on stale or incorrect data due to missing deletions.

 

Incremental Loads with Change Data Capture

  • Scenario: You are running incremental syncs and want to ensure that deletions are captured as part of the change data capture (CDC) process.

  • Why: Ensures that all changes, including removals, are reflected in the target system.

 

Data Recovery or Disaster Recovery Scenarios

  • Scenario: After a recovery event, you need to re-sync all data, including deletions, to restore downstream systems to the correct state.

  • Why: Guarantees that no deleted or merged-away records are inadvertently restored or left behind.


Here are scenarios where it is not a good idea to use reindexDeleted=true with the syncToDataPipeline endpoint in Reltio:

 

Performance-Sensitive or Large-Scale Syncs

  • Scenario: You are syncing a huge tenant or dataset and want to minimize the volume of data processed and transferred.

  • Why: Including deleted/loser records can significantly increase the amount of data to be processed, potentially impacting performance, sync duration, and resource usage.

 

Downstream System Does Not Track Deletions

  • Scenario: Your downstream system (e.g., reporting database, analytics platform) is designed to only store active records and does not require or support tracking deletions or merged-away records.

  • Why: Sending deleted records may unnecessarily complicate downstream data models or processes.

 

One-Time Sync for Active Data Only

  • Scenario: You are performing a one-time sync to populate a downstream system with only the current, active state of your data (e.g., for a new analytics dashboard that only needs active entities/relations).

  • Why: Including deleted records is unnecessary and may clutter the target system.

 

Testing or Development Environments

  • Scenario: You are running syncs in a test or development environment where you only care about current data for validation or prototyping.

  • Why: Including deleted records may slow down testing and is usually not needed for development purposes.

 

Cost-Sensitive Data Pipelines

  • Scenario: Your data pipeline or downstream storage incurs costs based on data volume (e.g., cloud storage, data transfer).

  • Why: Including deleted records increases data volume and may lead to unnecessary costs if deletions are not required downstream.

 

Downstream Consumers Not Ready for Deletion Events

  • Scenario: Your downstream consumers (ETL jobs, dashboards, applications) are not yet designed to handle deletion or merge events.

  • Why: Sending deleted records could cause errors, confusion, or data integrity issues in those systems.

 

Incremental Syncs Where Deletions Are Already Handled

  • Scenario: You have a separate process or mechanism for handling deletions in downstream systems.

  • Why: Using reindexDeleted=true would be redundant and could lead to duplicate or conflicting deletion events.

 

Regulatory or Privacy Restrictions

  • Scenario: There are regulatory or privacy requirements that restrict the propagation of deleted data (even as a deletion event) to certain downstream systems.

  • Why: Syncing deleted records could violate compliance policies.


Let’s break down whether using
?dataTypes=merges,relations&reindexDeleted=true
with the syncToDataPipeline endpoint is a good idea, based on Reltio documentation and best practices.

What Does This Do?

  • dataTypes=merges,relations: Syncs only merge events and relation data.

  • reindexDeleted=true: Instructs the sync to also include deleted/loser records for the specified data types.

 

It’s a Good Idea If:

  1. You Need Downstream Consistency for Merges and Relations

    • If your downstream systems (like BigQuery, Snowflake, etc.) must reflect not only active merges and relations but also those that have been deleted or merged away, this combination ensures full fidelity.

    • Example: You want to track the full lifecycle of relationships and merges, including those that have been removed.

  2. You’re Auditing or Reconciling Data

    • If you need a complete audit trail or are reconciling data between Reltio and downstream systems, including deletions, is essential.

  3. You’re Cleaning Up After Bulk Merges or Deletions

    • After major merge operations, this ensures that “loser” records and deleted relations are also removed or flagged downstream.


It’s NOT a Good Idea If:

  1. Downstream Systems Only Need Active Data

    • If your downstream consumers only care about current, active merges and relations, including deleted ones adds unnecessary complexity and data volume.

  2. Performance or Cost is a Concern

    • Including deleted records increases the data processed and transferred, which can impact performance and cost, especially for large tenants.

  3. Downstream Consumers Can’t Handle Deletion Events

    • If your downstream ETL, analytics, or reporting tools aren’t set up to process deletions or “loser” records, this could cause confusion or errors.


Special Note from Reltio Documentation

If you specify an exact dataType (for example, dataTypes=entities), the reindexDeleted parameter will not have any effect unless you also include the deleted type (for example, dataTypes=entities,deleted_entities).
However, for merges and relations, reindexDeleted=true will ensure that deleted/loser records for those types are included in the sync.


 

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.