Chapter 3. Grunt

Grunt[4] is Pig’s interactive shell. It enables users to enter Pig Latin interactively and provides a shell for users to interact with HDFS.

To enter Grunt, invoke Pig with no script or command to run. Typing:

pig -x local

will result in the prompt:

grunt>

This gives you a Grunt shell to interact with your local filesystem. If you omit the -x local and have a cluster configuration set in PIG_CLASSPATH, this will put you in a Grunt shell that will interact with HDFS on your cluster.

As you would expect with a shell, Grunt provides command-line history and editing, as well as Tab completion. It does not provide filename completion via the Tab key. That is, if you type kil and then press the Tab key, it will complete the command as kill. But if you have a file foo in your local directory and type ls fo, and then hit Tab, it will not complete it as ls foo. This is because the response time from HDFS to connect and find whether the file exists is too slow to be useful.

Although Grunt is a useful shell, remember that it is not a full-featured shell. It does not provide a number of commands found in standard Unix shells, such as pipes, redirection, and background execution.

To exit Grunt you can type quit or enter Ctrl-D.

Entering Pig Latin Scripts in Grunt

One of the main uses of Grunt is to enter Pig Latin in an interactive session. This can be particularly useful for quickly sampling your data and for prototyping new Pig Latin scripts.

You can enter Pig Latin directly into Grunt. Pig will not start executing the Pig Latin you enter until it sees either a store or dump. However, it will do basic syntax and semantic checking to help you catch errors quickly. If you do make a mistake while entering a line of Pig Latin in Grunt, you can reenter the line using the same alias, and Pig will take the last instance of the line you enter. For example:

pig  -x local
grunt> dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
grunt> symbols = foreach dividends generate symbl;
...Error during parsing. Invalid alias: symbl ...
grunt> symbols = foreach A generate symbol;
...

HDFS Commands in Grunt

Besides entering Pig Latin interactively, Grunt’s other major use is to act as a shell for HDFS. In versions 0.5 and later of Pig, all hadoop fs shell commands are available. They are accessed using the keyword fs. The dash (-) used in the hadoop fs is also required:

grunt>fs -ls

You can see a complete guide to the available commands at http://hadoop.apache.org/common/docs/r0.20.2/hdfs_shell.html. A number of the commands come directly from Unix shells and will operate in ways that are familiar: chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, and stat. A few of them either look like Unix commands you are used to but behave slightly differently or are unfamiliar, including:

cat filename

Print the contents of a file to stdout. You can apply this command to a directory and it will apply itself in turn to each file in the directory.

copyFromLocal localfile hdfsfile

Copy a file from your local disk to HDFS. This is done serially, not in parallel.

copyToLocal hdfsfile localfile

Copy a file from HDFS to your local disk. This is done serially, not in parallel.

rmr filename

Remove files recursively. This is equivalent to rm -r in Unix. Use this with caution.

In versions of Pig before 0.5, hadoop fs commands were not available. Instead, Grunt had its own implementation of some of these commands: cat, cd, copyFromLocal, copyToLocal, cp, ls, mkdir, mv, pwd, rm (which acted like Hadoop’s rmr, not Hadoop’s rm), and rmf. As of Pig 0.8, all of these commands are still available. However, with the exception of cd and pwd, these commands are deprecated in favor of using hadoop fs, and they might be removed at some point in the future.

In version 0.8, a new command was added to Grunt: sh. This command gives you access to the local shell, just as fs gives you access to HDFS. Simple shell commands that do not involve pipes or redirects can be executed. It is better to work with absolute paths, as sh does not always properly track the current working directory.

Controlling Pig from Grunt

Grunt also provides commands for controlling Pig and MapReduce:

kill jobid

Kill the MapReduce job associated with jobid. The output of the pig command that spawned the job will list the ID of each job it spawns. You can also find the job’s ID by looking at Hadoop’s JobTracker GUI, which lists all jobs currently running on the cluster. Note that this command kills a particular MapReduce job. If your Pig job contains other MapReduce jobs that do not depend on the killed MapReduce job, these jobs will still continue. If you want to kill all of the MapReduce jobs associated with a particular Pig job, it is best to terminate the process running Pig, and then use this command to kill any MapReduce jobs that are still running. Make sure to terminate the Pig process with a Ctrl-C or a Unix kill, not a Unix kill -9. The latter does not give Pig the chance to clean up temporary files it is using, which can leave garbage in your cluster.

exec [[-param param_name = param_value]] [[-param_file filename]] script

Execute the Pig Latin script script. Aliases defined in script are not imported into Grunt. This command is useful for testing your Pig Latin scripts while inside a Grunt session. For information on the -param and -param_file options, see “Parameter Substitution”.

run [[-param param_name = param_value]] [[-param_file filename]] script

Execute the Pig Latin script script in the current Grunt shell. Thus all aliases referenced in script are available to Grunt, and the commands in script are accessible via the shell history. This is another option for testing Pig Latin scripts while inside a Grunt session. For information on the -param and -param_file options, see “Parameter Substitution”.



[4] According to Ben Reed, one of the researchers at Yahoo! who helped start Pig, they named the shell Grunt because they felt the initial implementation was so limited that it was not worthy even of the name “oink.”