Azure Data Factory pipeline into compressed Parquet file: “java.lang.OutOfMemoryError:Java heap space”
-
I have met the following problem. I have a pipeline that reads data from an MS SQL Server and stores them into a file in a BLOB container in Azure Storage. The file has Parquet (or Apache Parquet, as it is also called) format. So, when the “sink” (output) file is stored in a compressed way (snappy, or gzip – does not matter) AND the file is large enough (more than 50 Mb), the pipeline failed. The message was the following: "errorCode": "2200", "message": "Failure happened on 'Sink' side. ErrorCode=UserErrorJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space\ntotal entry:11\r\njava.util.ArrayDeque.doubleCapacity(Unknown Source)\r\njava.util.ArrayDeque.addFirst(Unknown Source)\r\njava.util.ArrayDeque.push(Unknown Source)\r\norg.apache.parquet.io.ValidatingRecordConsumer.endField(ValidatingRecordConsumer.java:108)\r\norg.apache.parquet.example.data.GroupWriter.writeGroup(GroupWriter.java:58)\r\norg.apache.parquet.example.data.GroupWriter.write(GroupWriter.java:37)\r\norg.apache.parquet.hadoop.example.GroupWriteSupport.write(GroupWriteSupport.java:87)\r\norg.apache.parquet.hadoop.example.GroupWriteSupport.write(GroupWriteSupport.java:37)\r\norg.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)\r\norg.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:292)\r\ncom.microsoft.datatransfer.bridge.parquet.ParquetBatchWriter.addRows(ParquetBatchWriter.java:60)\r\n,Source=Microsoft.DataTransfer.Common,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'", "failureType": "UserError", "target": "Work_Work" } The "Work_Work" is name of a Copy Data activity in the pipeline. If I turn the compression off (the generated BLOB file is uncompressed), the error does not happen. Is this the error described in https://docs.microsoft.com/en-us/azure/data-factory/format-parquet: the “…If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space", you can add an environment variable _JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such copy, then rerun the pipeline….”? If it is, have I understood correctly that I have to do the following: To go to a s