Overview
This article explains why compaction is not supported for the raw_activities table in Reltio's Google BigQuery (GBQ) connector and what this means for duplicate record management.
What is Compaction?
Compaction is an automated maintenance process in Reltio's data pipelines that:
- Identifies and removes duplicate records from BigQuery tables
- Retains only the most recent version of each record
- Reduces storage costs and improves query performance
- Runs automatically based on configured thresholds and schedules
For most Reltio tables (entities, relations, matches, merges, etc.), compaction executes when the duplicate percentage exceeds the compactionThreshold (typically 50%) and after a configured time window (hoursToCompaction).
Compaction is Not Supported for Activities Tables
Technical Architecture Limitation
The raw_activities table has a fundamentally different structure and purpose compared to other Reltio tables:
Activities Table:
- Contains audit log entries of all system actions
- Uses
activityIdas an identifier, but may have legitimate multiple entries - Contains nested RECORD structures (the
itemsarray field) - Represents historical event streams rather than the current state
- No built-in compaction logic exists for this table type
Schema Complexity
As shown in your schema, the raw_activities table contains:
- Top-level activity metadata (activityUser, activityMethod, activityUrl, etc.)
- A REPEATED RECORD field called
itemscontaining:- Object references (objectUri, objectType, objectLabel)
- Start and end object information
- Event details (eventId, eventType)
- JSON data and delta fields
This nested structure makes automatic deduplication complex because:
- Simple row-level deduplication may lose granular event details
- The
itemsarray may contain multiple events within a single activity - Determining which record is "latest" or "correct" requires business logic not implemented in the standard compaction process.
Total Records: 3,186,225 Distinct Activities: 219,971 Duplication Rate: ~93.1% Compaction Threshold: 50% (exceeded but not triggered) ``` **Why Compaction Hasn't Run:** Despite exceeding the 50% threshold, compaction has not executed because: 1. No compaction logic is implemented for `raw_activities` 2. The compaction script specifically excludes this table type 3. Manual API calls to trigger compaction will fail for activities tables --- ## Error Encountered in Previous Attempts Error while invoking the script for tables compaction BigQueryException: Caller does not have required permission to use project testbigqry
While this specific error relates to permissions, even with correct permissions, the compaction would not proceed for raw_activities because the underlying compaction logic does not exist for this table type.
Available Solutions
Option 1: Custom Deduplication Script (Development Required)
Description: Create a custom SQL script (raw_activities_compact.sql) to manually deduplicate records.
Example Approach:
CREATE OR REPLACE TABLE `project.dataset.raw_activities_cleaned` AS
SELECT * EXCEPT(row_num)
FROM (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY activityId
ORDER BY activityTimestamp DESC
) as row_num
FROM `project.dataset.raw_activities`
)
WHERE row_num = 1;Considerations:
- Requires custom development and testing
- Must account for the nested
itemsstructure - Need to define business rules for which record to keep
- Ongoing maintenance required if sync patterns change
- Should be scheduled to run periodically
Timeline: Custom development and validation required
Option 2: Table Rebuild
Description: Drop and recreate the activities table with a fresh sync.
Process:
- Backup current data (if historical analysis is needed)
- Drop the table:
DROP TABLE `project.dataset.raw_activities`;
- Re-trigger the GBQ sync for activities data only
- Monitor the initial sync to ensure a clean data load
Advantages:
- Immediate reduction in duplicates
- Clean starting point
- No custom code development required
Considerations:
- Temporary loss of historical activity data (unless backed up)
- Downtime during rebuild
- Does not prevent future duplicates from accumulating
- May need to be repeated periodically
Data Quality
- Reporting Accuracy: COUNT queries return inflated numbers. The query workaround is to apply the SELECT COUNT using the DISTINCT function for activityID.
Analysis Complexity: Requires additional deduplication logic in all queries
Best Practices Going Forward
- Monitor Sync Jobs: Review sync execution logs to identify overlapping jobs
- Use Deduplication Views: Create views for reporting that handle deduplication
- Regular Maintenance: Schedule periodic cleanup (manual or automated)
- Cost Monitoring: Track BigQuery storage and query costs to measure improvement
- Documentation: Document which tables require manual deduplication
Summary
- Compaction is not supported for
raw_activitiestables due to architectural limitations - Other tables compact successfully because they have simpler structures and implemented logic
- Duplicates are expected without manual intervention or custom solutions
- Two primary options exist: custom deduplication scripts or periodic table rebuilds
- The recommended approach combines immediate cleanup with a long-term maintenance strategy
Related Configuration
Example of DPH pipeline configuration:
{
"compactionThreshold": 0.5,
"hoursToCompaction": 168,
"splitTable": true
}These settings apply to supported tables but do not affect raw_activities.
Note: This is a known limitation of the current GBQ connector architecture. Feature requests for automated activities compaction can be submitted through your Reltio account team.
Comments
Please sign in to leave a comment.