How to minimize the files created by Spark export v2, to make it easier to work with or to match the default behavior of export v1.
We have an option in Spark-based export which is the parameter Skip Postprocessing that can be set to false. Performing this will aggregate multiple output files into a single file.
The parameter Skip Postprocessing, which by default is set to true, can be set to false.
To provide high throughput, an export request is distributed across multiple tasks (native) or nodes (Spark), which results in multiple output files. The purpose of the post-processing option is to aggregate multiple output files into a single file, which occurs when you set the
skipPostprocessing parameter to
The export service performs export in two stages. First, it exports temporary partial files to the Reltio bucket with an auto-generated path. Next, it combines these files into one or several (if
partSize is specified) files and stores the result to the specified Reltio or customer storage.
There is the ability to skip the second step by passing the parameter
true in all endpoints of Spark and native export. In this case, the partial files are stored depending on the parameters in the following:
- Reltio bucket with an auto-generated path
- Reltio bucket with custom path
- customer bucket and path
The differences between setting
skipPostprocessing to either
true are summarized below.
- the default for v1 export
- single output file (multiple in case
- significantly lower performance
- the default for Smart Export (v2)
- multiple output files (
partSizethe parameter is ignored)
- higher performance
POST <Export Service URL>/v2/export/<tenant>/entities&skipPostprocessing