Is there Compaction Support for Activities Tables for any DPH connectors?

Overview

This article explains why compaction is not supported for the raw_activities table in Reltio's Google BigQuery (GBQ) connector and what this means for duplicate record management.


What is Compaction?

Compaction is an automated maintenance process in Reltio's data pipelines that:

  • Identifies and removes duplicate records from BigQuery tables
  • Retains only the most recent version of each record
  • Reduces storage costs and improves query performance
  • Runs automatically based on configured thresholds and schedules

For most Reltio tables (entities, relations, matches, merges, etc.), compaction executes when the duplicate percentage exceeds the compactionThreshold (typically 50%) and after a configured time window (hoursToCompaction).

Compaction is Not Supported for Activities Tables

Technical Architecture Limitation

The raw_activities table has a fundamentally different structure and purpose compared to other Reltio tables:

Activities Table:

  • Contains audit log entries of all system actions
  • Uses activityId as an identifier, but may have legitimate multiple entries
  • Contains nested RECORD structures (the items array field)
  • Represents historical event streams rather than the current state
  • No built-in compaction logic exists for this table type

Schema Complexity

As shown in your schema, the raw_activities table contains:

  • Top-level activity metadata (activityUser, activityMethod, activityUrl, etc.)
  • A REPEATED RECORD field called items containing:
    • Object references (objectUri, objectType, objectLabel)
    • Start and end object information
    • Event details (eventId, eventType)
    • JSON data and delta fields

This nested structure makes automatic deduplication complex because:

  1. Simple row-level deduplication may lose granular event details
  2. The items array may contain multiple events within a single activity
  3. Determining which record is "latest" or "correct" requires business logic not implemented in the standard compaction process.
 
Total Records: 3,186,225
Distinct Activities: 219,971
Duplication Rate: ~93.1%
Compaction Threshold: 50% (exceeded but not triggered)
```

**Why Compaction Hasn't Run:**
Despite exceeding the 50% threshold, compaction has not executed because:
1. No compaction logic is implemented for `raw_activities`
2. The compaction script specifically excludes this table type
3. Manual API calls to trigger compaction will fail for activities tables

---

## Error Encountered in Previous Attempts

Error while invoking the script for tables compaction
BigQueryException: Caller does not have required permission to use project testbigqry

While this specific error relates to permissions, even with correct permissions, the compaction would not proceed for raw_activities because the underlying compaction logic does not exist for this table type.


Available Solutions

Option 1: Custom Deduplication Script (Development Required)

Description: Create a custom SQL script (raw_activities_compact.sql) to manually deduplicate records.

Example Approach:

 
sql
CREATE OR REPLACE TABLE `project.dataset.raw_activities_cleaned` AS
SELECT * EXCEPT(row_num)
FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY activityId 
      ORDER BY activityTimestamp DESC
    ) as row_num
  FROM `project.dataset.raw_activities`
)
WHERE row_num = 1;

Considerations:

  • Requires custom development and testing
  • Must account for the nested items structure
  • Need to define business rules for which record to keep
  • Ongoing maintenance required if sync patterns change
  • Should be scheduled to run periodically

Timeline: Custom development and validation required


Option 2: Table Rebuild

Description: Drop and recreate the activities table with a fresh sync.

Process:

  1. Backup current data (if historical analysis is needed)
  2. Drop the table:
 
sql
   DROP TABLE `project.dataset.raw_activities`;
  1. Re-trigger the GBQ sync for activities data only
  2. Monitor the initial sync to ensure a clean data load

Advantages:

  • Immediate reduction in duplicates
  • Clean starting point
  • No custom code development required

Considerations:

  • Temporary loss of historical activity data (unless backed up)
  • Downtime during rebuild
  • Does not prevent future duplicates from accumulating
  • May need to be repeated periodically

Data Quality

  • Reporting Accuracy: COUNT queries return inflated numbers.  The query workaround is to apply the SELECT COUNT using the DISTINCT function for activityID.
  • Analysis Complexity: Requires additional deduplication logic in all queries

     


Best Practices Going Forward

  1. Monitor Sync Jobs: Review sync execution logs to identify overlapping jobs
  2. Use Deduplication Views: Create views for reporting that handle deduplication
  3. Regular Maintenance: Schedule periodic cleanup (manual or automated)
  4. Cost Monitoring: Track BigQuery storage and query costs to measure improvement
  5. Documentation: Document which tables require manual deduplication

Summary

  • Compaction is not supported for raw_activities tables due to architectural limitations
  • Other tables compact successfully because they have simpler structures and implemented logic
  • Duplicates are expected without manual intervention or custom solutions
  • Two primary options exist: custom deduplication scripts or periodic table rebuilds
  • The recommended approach combines immediate cleanup with a long-term maintenance strategy

Related Configuration

Example of DPH pipeline configuration:

 
json
{
  "compactionThreshold": 0.5,
  "hoursToCompaction": 168,
  "splitTable": true
}

These settings apply to supported tables but do not affect raw_activities.


Note: This is a known limitation of the current GBQ connector architecture. Feature requests for automated activities compaction can be submitted through your Reltio account team.

 
 
 

 

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.