Querying large datasets with Hive is trivial. However, there are times when Hive’s built in functions are insufficient. For example, I recently needed to extract the ordering of elements in an array.
The common advice for such problems is to write a custom Java UDF (User Defined Function). However, since I do not know Java, that would take a lot of time. Luckily, it’s easy to create a script in another language and feed it into a hive query using the function TRANSFORM.
For example, the query below runs a python script on an array of ids seen in search and several other columns.
The python script below takes in a set of lines in which table columns are delimited by tabs. It splits the lines and transforms the first column to include the search rank as a subfield. Note, the script must also handle the the other fields and return them in the proper order. Getting the formatting correct is important because Hive is fickle and not very verbose when returning errors.