RegexSerDe uses regular expression (regex) to serialize/deserialize.
It can deserialize the data using regex and extracts groups as columns. It
can also serialize the row object using a format string.
In deserialization stage, if a row does not match the regex, then all columns
in the row will be NULL. If a row matches the regex but has less than
expected groups, the missing groups will be NULL. If a row matches the regex
but has more than expected groups, the additional groups are just ignored.
In serialization stage, it uses java string formatter to format the columns
into a row. If the output type of the column in a query is not a string, it
will be automatically converted to String by Hive.
For the format of the format String, please refer to
httpNOTE: Obviously, all columns have to be strings. Users can use
"CAST(a AS INT)" to convert columns to other types.
NOTE: This implementation is using String, and javaStringObjectInspector. A
more efficient implementation should use UTF-8 encoded Text and
writableStringObjectInspector. We should switch to that when we have a UTF-8
based Regex library.