A load function that parses a line of input into fields using a character delimiter.
The default delimiter is a tab. You can specify any character as a literal ("a"),
a known escape character ("\\t"), or a dec or hex value ("\\u001", "\\x0A").
An optional second constructor argument is provided that allows one to customize
advanced behaviors. A list of available options is below:
-schema Reads/Stores the schema of the relation using a
hidden JSON file.
-noschema Ignores a stored schema during loading.
-tagFile Appends input source file name to beginning of each tuple.
-tagPath Appends input source file path to beginning of each tuple.
is specified, a hidden ".pig_schema" file is created in the output directory
when storing data. It is used by PigStorage (with or without -schema) during loading to determine the
field names and types of the data without the need for a user to explicitly provide the schema in an
is specified. No attempt to merge conflicting
schemas is made during loading. The first schema encountered during a file system scan is used.
If the schema file is not present while '-schema' option is used during loading,
it results in an error.
In addition, using
-schema drops a ".pig_headers" file in the output directory.
This file simply lists the delimited aliases. This is intended to make export to tools that can read
files with header lines easier (just cat the header to your data).
is specified, PigStorage will prepend input split name to each Tuple/row.
Usage: A = LOAD 'input' using PigStorage(',','-tagFile'); B = foreach A generate $0;
The first field (0th index) in each Tuple will contain input file name.
is specified, PigStorage will prepend input split path to each Tuple/row.
Usage: A = LOAD 'input' using PigStorage(',','-tagPath'); B = foreach A generate $0;
The first field (0th index) in each Tuple will contain input file path
Note that regardless of whether or not you store the schema, you always need to specify
the correct delimiter to read your data. If you store reading delimiter "#" and then load using
the default delimiter, your data will not be parsed correctly.
Storing to a directory whose name ends in ".bz2" or ".gz" or ".lzo" (if you have installed support
for LZO compression in Hadoop) will automatically use the corresponding compression codec.
Loading from directories ending in .bz2 or .bz works automatically; other compression formats are not
auto-detected on loading.