How can I aggregate the output into a single file In Spark-based export (v2)?


How to minimize the files created by Spark export v2, to make it easier to work with or to match the default behavior of export v1.


We have an option in Spark-based export which is the parameter Skip Postprocessing that can be set to false. Performing this will aggregate multiple output files into a single file.

The parameter Skip Postprocessing, which by default is set to true, can be set to false.

To provide high throughput, an export request is distributed across multiple tasks (native) or nodes (Spark), which results in multiple output files. The purpose of the post-processing option is to aggregate multiple output files into a single file, which occurs when you set the skipPostprocessing parameter to false.

The export service performs export in two stages. First, it exports temporary partial files to the Reltio bucket with an auto-generated path. Next, it combines these files into one or several (if partSize is specified) files and stores the result to the specified Reltio or customer storage.

There is the ability to skip the second step by passing the parameterskipPostprocessing=true in all endpoints of Spark and native export. In this case, the partial files are stored depending on the parameters in the following:

  • Reltio bucket with an auto-generated path
  • Reltio bucket with custom path
  • customer bucket and path

The differences between setting skipPostprocessing to either false or true are summarized below.

  • skipPostpocessing=false
    • the default for v1 export
    • single output file (multiple in case partSize is specified)
    • significantly lower performance
  • skipPostpocessing=true
    • the default for Smart Export (v2)
    • multiple output files (partSize the parameter is ignored)
    • higher performance


POST <Export Service URL>/v2/export/<tenant>/entities&skipPostprocessing




Was this article helpful?
0 out of 0 found this helpful



Please sign in to leave a comment.