OutputFormat to write to a Parquet file
It requires a
WriteSupport to convert the actual records to the underlying format.
It requires the schema of the incoming records. (provided by the write support)
It allows storing extra metadata in the footer (for example: for schema compatibility purpose when converting from a different schema language).
The format configuration settings in the job configuration:
# The block size is the size of a row group being buffered in memory
# this limits the memory usage when writing
# Larger values will improve the IO when reading but consume more memory when writing
parquet.block.size=134217728 # in bytes, default = 128 * 1024 * 1024
# The page size is for compression. When reading, each page can be decompressed independently.
# A block is composed of pages. The page is the smallest unit that must be read fully to access a single record.
# If this value is too small, the compression will deteriorate
parquet.page.size=1048576 # in bytes, default = 1 * 1024 * 1024
# There is one dictionary page per column per row group when dictionary encoding is used.
# The dictionary page size works like the page size but for dictionary
parquet.dictionary.page.size=1048576 # in bytes, default = 1 * 1024 * 1024
# The compression algorithm used to compress pages
parquet.compression=UNCOMPRESSED # one of: UNCOMPRESSED, SNAPPY, GZIP, LZO. Default: UNCOMPRESSED. Supersedes mapred.output.compress
# The write support class to convert the records written to the OutputFormat into the events accepted by the record consumer
# Usually provided by a specific ParquetOutputFormat subclass
parquet.write.support.class= # fully qualified name
# To enable/disable dictionary encoding
parquet.enable.dictionary=true # false to disable dictionary encoding
# To enable/disable summary metadata aggregation at the end of a MR job
# The default is true (enabled)
parquet.enable.summary-metadata=true # false to disable summary aggregation
If parquet.compression is not set, the following properties are checked (FileOutputFormat behavior).
Note that we explicitely disallow custom Codecs
mapred.output.compress=true
mapred.output.compression.codec=org.apache.hadoop.io.compress.SomeCodec # the codec must be one of Snappy, GZip or LZO
if none of those is set the data is uncompressed.