A load function that parses a line of input into fields using a character delimiter.
The default delimiter is a tab. You can specify any character as a literal ("a"),
a known escape character ("\\t"), or a dec or hex value ("\\u001", "\\x0A").
An optional second constructor argument is provided that allows one to customize
advanced behaviors. A list of available options is below:
-schema
Reads/Stores the schema of the relation using a
hidden JSON file.
-noschema
Ignores a stored schema during loading.
-tagFile
Appends input source file name to beginning of each tuple.
-tagPath
Appends input source file path to beginning of each tuple.
Schemas
If
-schema
is specified, a hidden ".pig_schema" file is created in the output directory
when storing data. It is used by PigStorage (with or without -schema) during loading to determine the
field names and types of the data without the need for a user to explicitly provide the schema in an
as
clause, unless
-noschema
is specified. No attempt to merge conflicting
schemas is made during loading. The first schema encountered during a file system scan is used.
If the schema file is not present while '-schema' option is used during loading,
it results in an error.
In addition, using -schema
drops a ".pig_headers" file in the output directory.
This file simply lists the delimited aliases. This is intended to make export to tools that can read
files with header lines easier (just cat the header to your data).
Source tagging
If
-tagFile
is specified, PigStorage will prepend input split name to each Tuple/row.
Usage: A = LOAD 'input' using PigStorage(',','-tagFile'); B = foreach A generate $0;
The first field (0th index) in each Tuple will contain input file name.
If
-tagPath
is specified, PigStorage will prepend input split path to each Tuple/row.
Usage: A = LOAD 'input' using PigStorage(',','-tagPath'); B = foreach A generate $0;
The first field (0th index) in each Tuple will contain input file path
Note that regardless of whether or not you store the schema, you always need to specify
the correct delimiter to read your data. If you store reading delimiter "#" and then load using
the default delimiter, your data will not be parsed correctly.
Compression
Storing to a directory whose name ends in ".bz2" or ".gz" or ".lzo" (if you have installed support
for LZO compression in Hadoop) will automatically use the corresponding compression codec.
output.compression.enabled
and
output.compression.codec
job properties
also work.
Loading from directories ending in .bz2 or .bz works automatically; other compression formats are not
auto-detected on loading.