How to use
ParquetOutputFormat
in
parquet.hadoop

Best Java code snippets using parquet.hadoop.ParquetOutputFormat (Showing top 12 results out of 315)

public TestMapredParquetOutputFormat(Optional<MessageType> schema, boolean singleLevelArray)
{
  super(new ParquetOutputFormat<>(new TestDataWritableWriteSupport(singleLevelArray)));
  this.schema = requireNonNull(schema, "schema is null");
}

public RecordWriter<Void, T> getRecordWriter(Configuration conf, Path file, CompressionCodecName codec)
   throws IOException, InterruptedException {
 final WriteSupport<T> writeSupport = getWriteSupport(conf);
 long blockSize = getLongBlockSize(conf);
 if (INFO) LOG.info("Parquet block size to " + blockSize);
 int pageSize = getPageSize(conf);
 if (INFO) LOG.info("Parquet page size to " + pageSize);
 int dictionaryPageSize = getDictionaryPageSize(conf);
 if (INFO) LOG.info("Parquet dictionary page size to " + dictionaryPageSize);
 boolean enableDictionary = getEnableDictionary(conf);
 if (INFO) LOG.info("Dictionary is " + (enableDictionary ? "on" : "off"));
 boolean validating = getValidation(conf);
 if (INFO) LOG.info("Validation is " + (validating ? "on" : "off"));
 WriterVersion writerVersion = getWriterVersion(conf);
 if (INFO) LOG.info("Writer version is: " + writerVersion);

/**
 * {@inheritDoc}
 */
@Override
public RecordWriter<Void, T> getRecordWriter(TaskAttemptContext taskAttemptContext)
  throws IOException, InterruptedException {
 final Configuration conf = getConfiguration(taskAttemptContext);
 CompressionCodecName codec = getCodec(taskAttemptContext);
 String extension = codec.getExtension() + ".parquet";
 Path file = getDefaultWorkFile(taskAttemptContext, extension);
 return getRecordWriter(conf, file, codec);
}

public RecordWriter<Void, T> getRecordWriter(TaskAttemptContext taskAttemptContext, Path file)
  throws IOException, InterruptedException {
 return getRecordWriter(getConfiguration(taskAttemptContext), file, getCodec(taskAttemptContext));
}

public static int getDictionaryPageSize(JobContext jobContext) {
 return getDictionaryPageSize(getConfiguration(jobContext));
}

public static CompressionCodecName getCompression(JobContext jobContext) {
 return getCompression(getConfiguration(jobContext));
}

public static int getBlockSize(JobContext jobContext) {
 return getBlockSize(getConfiguration(jobContext));
}

@Override
public void checkOutputSpecs(final FileSystem ignored, final JobConf job) throws IOException {
 realOutputFormat.checkOutputSpecs(ShimLoader.getHadoopShims().getHCatShim().createJobContext(job, null));
}

/**
 * {@inheritDoc}
 */
@Override
public OutputFormat<Void, Tuple> getOutputFormat() throws IOException {
 return new ParquetOutputFormat<Tuple>(new TupleToThriftWriteSupport(className));
}

public MapredParquetOutputFormat() {
 realOutputFormat = new ParquetOutputFormat<ParquetHiveRecord>(new DataWritableWriteSupport());
}

/**
 * {@inheritDoc}
 */
@Override
public OutputFormat<Void, Tuple> getOutputFormat() throws IOException {
 Schema pigSchema = getSchema();
 return new ParquetOutputFormat<Tuple>(new TupleWriteSupport(pigSchema));
}

public TestMapredParquetOutputFormat(Optional<MessageType> schema, boolean singleLevelArray)
{
  super(new ParquetOutputFormat<>(new TestDataWritableWriteSupport(singleLevelArray)));
  this.schema = requireNonNull(schema, "schema is null");
}

Javadoc

OutputFormat to write to a Parquet file It requires a WriteSupport to convert the actual records to the underlying format. It requires the schema of the incoming records. (provided by the write support) It allows storing extra metadata in the footer (for example: for schema compatibility purpose when converting from a different schema language). The format configuration settings in the job configuration:

 
# The block size is the size of a row group being buffered in memory 
# this limits the memory usage when writing 
# Larger values will improve the IO when reading but consume more memory when writing 
parquet.block.size=134217728 # in bytes, default = 128 * 1024 * 1024 
# The page size is for compression. When reading, each page can be decompressed independently. 
# A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. 
# If this value is too small, the compression will deteriorate 
parquet.page.size=1048576 # in bytes, default = 1 * 1024 * 1024 
# There is one dictionary page per column per row group when dictionary encoding is used. 
# The dictionary page size works like the page size but for dictionary 
parquet.dictionary.page.size=1048576 # in bytes, default = 1 * 1024 * 1024 
# The compression algorithm used to compress pages 
parquet.compression=UNCOMPRESSED # one of: UNCOMPRESSED, SNAPPY, GZIP, LZO. Default: UNCOMPRESSED. Supersedes mapred.output.compress 
# The write support class to convert the records written to the OutputFormat into the events accepted by the record consumer 
# Usually provided by a specific ParquetOutputFormat subclass 
parquet.write.support.class= # fully qualified name 
# To enable/disable dictionary encoding 
parquet.enable.dictionary=true # false to disable dictionary encoding 
# To enable/disable summary metadata aggregation at the end of a MR job 
# The default is true (enabled) 
parquet.enable.summary-metadata=true # false to disable summary aggregation

If parquet.compression is not set, the following properties are checked (FileOutputFormat behavior). Note that we explicitely disallow custom Codecs

 
mapred.output.compress=true 
mapred.output.compression.codec=org.apache.hadoop.io.compress.SomeCodec # the codec must be one of Snappy, GZip or LZO

if none of those is set the data is uncompressed.

Most used methods

<init>
constructor used when this OutputFormat in wrapped in another one (In Pig for example)
checkOutputSpecs
getBlockSize
getCodec
getCompression
getDefaultWorkFile
getDictionaryPageSize
getEnableDictionary
getLongBlockSize
getOutputPath
getPageSize
getRecordWriter

Popular in Java

Parsing JSON documents to java classes using gson
getContentResolver (Context)
onCreateOptionsMenu (Activity)
orElseThrow (Optional)
Return the contained value, if present, otherwise throw an exception to be created by the provided s
System (java.lang)
Provides access to system-related information and resources including standard input and output. Ena
DecimalFormat (java.text)
A concrete subclass of NumberFormat that formats decimal numbers. It has a variety of features desig
NumberFormat (java.text)
The abstract base class for all number formats. This class provides the interface for formatting and
ArrayList (java.util)
ArrayList is an implementation of List, backed by an array. All optional operations including adding
UUID (java.util)
UUID is an immutable representation of a 128-bit universally unique identifier (UUID). There are mul
SSLHandshakeException (javax.net.ssl)
The exception that is thrown when a handshake could not be completed successfully.
Top PhpStorm plugins

How to useParquetOutputFormat in parquet.hadoop

Best Java code snippets using parquet.hadoop.ParquetOutputFormat (Showing top 12 results out of 315)

How to use
ParquetOutputFormat
in
parquet.hadoop