Question
How to minimize the files created by Spark export v2, to make it easier to work with or to match the default behavior of export v1.
Answer
We have an option in Spark-based export which is the parameter Skip Postprocessing that can be set to false. Performing this will aggregate multiple output files into a single file.
The parameter Skip Postprocessing, which by default is set to true, can be set to false.
To provide high throughput, an export request is distributed across multiple tasks (native) or nodes (Spark), which results in multiple output files. The purpose of the post-processing option is to aggregate multiple output files into a single file, which occurs when you set the skipPostprocessing
parameter to false
.
The export service performs export in two stages. First, it exports temporary partial files to the Reltio bucket with an auto-generated path. Next, it combines these files into one or several (if partSize
is specified) files and stores the result to the specified Reltio or customer storage.
There is the ability to skip the second step by passing the parameterskipPostprocessing
=true
in all endpoints of Spark and native export. In this case, the partial files are stored depending on the parameters in the following:
- Reltio bucket with an auto-generated path
- Reltio bucket with custom path
- customer bucket and path
The differences between setting skipPostprocessing
to either false
or true
are summarized below.
skipPostpocessing
=false
- the default for v1 export
- single output file (multiple in case
partSize
is specified) - significantly lower performance
skipPostpocessing
=true
- the default for Smart Export (v2)
- multiple output files (
partSize
the parameter is ignored) - higher performance
Request:
POST <Export Service URL>/v2/export/<tenant>/entities&skipPostprocessing
References
https://documentation.reltio.com/exportapi/skippostprocessing.html?hl=skippostprocessing
Comments
Please sign in to leave a comment.