Index

Symbols

!= inequality operator, Filter
# dereference operator for maps, Map
$ macro parameter, Macros
$ parameter substitution target, Parameter Substitution
% modulo operator, Expressions in foreach
() tuple parentheses, Dump
* all fields, Expressions in foreach
* multiplication operator, Expressions in foreach
* zero or more characters glob, Load
+ addition operator, Expressions in foreach
- subtraction operator, Expressions in foreach
- unary negative operator, Expressions in foreach
-- single line comment operator, Comments
.. range of fields, Expressions in foreach
/ division operator, Expressions in foreach
/* */ multiline comment operator, Comments
< inequality operator, Filter
<= inequality operator, Filter
== equality operator, Filter
> inequality operator, Filter
>= inequality operator, Filter
? any character glob, Load
? bincond operator, Expressions in foreach
[] map brackets, Dump
\ escape character, Load
{} bag braces, Dump
{} macro operator, Macros

A

ABS function, Built-in math UDFs
accumulator interface, Accumulator Interface
ACID, NoSQL Databases
ACOS function, Built-in math UDFs
AddForEach optimization, Debugging Tips
algebraic calculations, Group, Algebraic Interface
algebraic interface, Algebraic Interface, Algebraic Interface
aliases, Preliminary Matters, define and UDFs
Amazon Elastic MapReduce (EMR), Pig’s History, Running Pig in the Cloud
Apache HBase, HBase, HBase
Apache HCatalog, Metadata in Hadoop
Apache Hive, Pig and Hive
Apache open source, What Is Pig?, Downloading the Pig Package from Apache
arithmetic operators, Expressions in foreach
as clause (load function), Load, Naming fields in foreach
as clause (stream command), stream
ASIN function, Built-in math UDFs
ATAN function, Built-in math UDFs
AVG functions, Built-in aggregate UDFs

B

bad records, handling, Bad Record Handling
bag data type, Bag, Schemas, Interacting with Pig values, Memory Issues in Eval Funcs, Python UDFs
bag DIFF function, Built-in complex type UDFs
bag projection, Expressions in foreach
bag TOBAG function, Built-in complex type UDFs
bag TOP function, Built-in complex type UDFs
BagFactory class, Interacting with Pig values
baseball examples
base on balls and IBBs, Schemas
batting average, Expressions in foreach
data set, Code Examples in This Book, flatten
players by position and team, Nonlinear Data Flows
slugging percentage, Registering Python UDFs
behavior prediction models, What Is Pig Useful For?
binary condition operator, Expressions in foreach
bind call, Bind
bindings, multiple, Binding Multiple Sets of Variables, Running Multiple Bindings
boolean IsEmpty functions, Built-in filter functions
Boolean operators, Filter
bottlenecks, Making Pig Fly
built-in aggregate UDFs, Built-in aggregate UDFs, Built-in aggregate UDFs
built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs, Built-in chararray and bytearray UDFs
built-in complex type UDFs, Built-in complex type UDFs, Built-in complex type UDFs
built-in filter functions, Built-in filter functions
built-in load and store functions, Built-in Load and Store Functions
built-in math UDFs, Built-in math UDFs
bytearray CONCAT functions, Built-in chararray and bytearray UDFs
bytearray type, Scalar Types, Schemas, Choose the Right Data Type, Python UDFs, Casting bytearrays

C

cache clause (define statement), stream
caching option (HBase), HBase
Cascading, Cascading
case sensitivity
Pig Latin, Case Sensitivity
UDF names, User Defined Functions, Writing an Evaluation Function in Java
Cassandra, Apache, Cassandra
Cassandra: The Definitive Guide (Hewitt), Cassandra
caster option (HBase), HBase
casts, Casts, Casts, Getting the casting functions, Casting bytearrays
cat command, HDFS Commands in Grunt, Order by
CBRT function, Built-in math UDFs
CEIL function, Built-in math UDFs
chararray functions
CONCAT, Built-in chararray and bytearray UDFs
LCFIRST, Built-in chararray and bytearray UDFs
LOWER, Built-in chararray and bytearray UDFs
MAX, Built-in aggregate UDFs
MIN, Built-in aggregate UDFs
REGEX_EXTRACT, Built-in chararray and bytearray UDFs
REGEX_EXTRACT_ALL, Built-in chararray and bytearray UDFs
REPLACE, Built-in chararray and bytearray UDFs
STRSPLIT, Built-in chararray and bytearray UDFs
SUBSTRING, Built-in chararray and bytearray UDFs
TOKENIZE, Built-in chararray and bytearray UDFs
TRIM, Built-in chararray and bytearray UDFs
UCFIRST, Built-in chararray and bytearray UDFs
UPPER, Built-in chararray and bytearray UDFs
chararray type, Scalar Types, Schemas, Filter, Python UDFs
checking syntax, Syntax Highlighting and Checking
Cloud computing, Running Pig in the Cloud
Cloudera, downloading Pig from, Downloading Pig from Cloudera
cluster
running Pig on your, Running Pig on Your Hadoop Cluster
setting up LZO on your, Using Compression in Intermediate Results
cogroup operator, Parallel, cogroup, Nonlinear Data Flows, Setting the Partitioner, explain, explain, Filter Early and Often
columnMapKeyPrune optimization, Debugging Tips
combiner phase, Group, Algebraic Interface, Combiner Phase
combiner, turning off, Debugging Tips
command tab completion, Grunt
command-line options, Command-Line and Configuration Options
comment operators (Pig Latin), Comments
compile method, Compile
complex data types, Complex Types, Nulls, Evaluation Function Basics, Input and Output Schemas, Built-in Evaluation and Filter Functions, Built-in complex type UDFs
compression, using in intermediate results, Using Compression in Intermediate Results
CONCAT functions, Built-in chararray and bytearray UDFs
constructors, Constructors and Passing Data from Frontend to Backend, UDFContext
controlling execution, Controlling Execution
copyFromLocal command, HDFS Commands in Grunt
copyToLocal command, HDFS Commands in Grunt
COR function, Built-in complex type UDFs
corrupted data, handling, Bad Record Handling
COS function, Built-in math UDFs
COSH function, Built-in math UDFs
COUNT function, Evaluation Function Basics, Algebraic Interface, Algebraic Interface, Accumulator Interface, Built-in aggregate UDFs
COUNT_STAR function, Built-in aggregate UDFs
COV function, Built-in complex type UDFs
cross operator, Parallel, cross, cross, Nonlinear Data Flows, Setting the Partitioner, Filter Early and Often

D

-D passing properties, Command-Line and Configuration Options
DAG (directed acyclic graph), Pig Latin, a Parallel Dataflow Language, Nonlinear Data Flows
data
layout optimization, Data Layout Optimization
passing, Constructors and Passing Data from Frontend to Backend
pipelines, What Is Pig Useful For?, Debugging Tips, Pig and Hive, Metadata in Hadoop
types, Types, Nulls, Choose the Right Data Type
writing, Writing Data, Writing records
data sets, example, Code Examples in This Book
dataflow languages, Pig Latin, a Parallel Dataflow Language, Embedding Pig Latin in Python
DataNodes, Loading the distributed cache, Distributed Cache, Hadoop Distributed File System
debugging, Debugging Tips
%declare, Parameter Substitution
declaring
a filename, Constructors and Passing Data from Frontend to Backend
a macro, Macros
a schema, Schemas, Input and Output Schemas
a type, Nonlinear Data Flows, Choose the Right Data Type
%default, Parameter Substitution
define statement, Registering UDFs, define and UDFs, stream, Macros, Constructors and Passing Data from Frontend to Backend
define utility method, Utility Methods
describe operator, describe
development tools, Development Tools, Debugging Tips
DeWitt, David J., Joining skewed data
DIFF function, Built-in complex type UDFs
directed acyclic graph (DAG), Pig Latin, a Parallel Dataflow Language, Nonlinear Data Flows
distinct operator, Distinct, Parallel, Nested foreach, Nested foreach, Setting the Partitioner, Filter Early and Often
distributed cache, Joining small to large data, stream, Loading the distributed cache, Distributed Cache
distributive calculations, Group, Algebraic Interface
double functions
ABS, Built-in math UDFs
ACOS, Built-in math UDFs
ASIN, Built-in math UDFs
ATAN, Built-in math UDFs
AVG, Built-in aggregate UDFs
CBRT, Built-in math UDFs
CEIL, Built-in math UDFs
COS, Built-in math UDFs
COSH, Built-in math UDFs
EXP, Built-in math UDFs
FLOOR, Built-in math UDFs
LOG, Built-in math UDFs
LOG10, Built-in math UDFs
MAX, Built-in aggregate UDFs, Built-in aggregate UDFs
MIN, Built-in aggregate UDFs
RANDOM, Miscellaneous built-in UDF
SIN, Built-in math UDFs
SINH, Built-in math UDFs
SQRT, Built-in math UDFs
SUM, Built-in aggregate UDFs
TAN, Built-in math UDFs
TANH, Built-in math UDFs
double type, Scalar Types, Schemas, Python UDFs
-dryrun command line option, Macros, Syntax Highlighting and Checking
dump statement, Dump

E

Eclipse syntax highlighting, Syntax Highlighting and Checking
Elastic MapReduce (EMR), Running Pig in the Cloud
Emacs syntax highlighting, Syntax Highlighting and Checking
embedding Pig Latin in Python, Embedding Pig Latin in Python, Utility Methods
EMR (Elastic MapReduce), Amazon, Running Pig in the Cloud
equality operators, Filter
errors
checking in Grunt, Entering Pig Latin Scripts in Grunt
debugging with explain, explain
in evaluation functions, Error Handling and Progress Reporting
failure cleanup, Failure Cleanup, Handling Failure
getErrorMessage function, Run
parse, Reading records
in Pig Latin scripts, How Pig differs from MapReduce
runtime exceptions, Input and Output Schemas
schema, Schemas, Schemas, union
sorting by maps, tuples, bags, Order by
escape characters (Unix shell command line), Load
ETL (extract transform load) data pipelines, What Is Pig Useful For?
evaluation functions
basics, UDFs in foreach, Evaluation Function Basics
built-in, Built-in Evaluation and Filter Functions, Miscellaneous built-in UDF
error handling and progress reporting, Error Handling and Progress Reporting
input and output schemas, Input and Output Schemas, Input and Output Schemas
memory issues in, Memory Issues in Eval Funcs
where your UDF will run, Where Your UDF Will Run
writing in Java, Writing an Evaluation Function in Java
examples, MapReduce’s hello world, MapReduce’s hello world, Expressions in foreach
(see also baseball examples)
(see also NYSE examples)
blacklisting URLs, stream, mapreduce
calculating page rank from web crawl, Code Examples in This Book, stream, mapreduce, Embedding Pig Latin in Python, Utility Methods
determining metropolitan area, cross
finding the top five URLs, How Pig differs from MapReduce
group then join in SQL and Pig Latin, Comparing query and dataflow languages
HBase table, HBase
“hello world”, MapReduce’s hello world
JsonLoader, Writing Load and Store Functions
JsonStorage, Writing Load and Store Functions
MetroResolver, Constructors and Passing Data from Frontend to Backend, Loading the distributed cache
running Pig in local mode, Running Pig Locally on Your Machine
running Pig on your cluster, Running Pig on Your Hadoop Cluster
store function, Store Functions, Store Functions and UDFContext, Storing Metadata
user distribution by city, Joining skewed data, cross
word count, MapReduce’s hello world
ZIP code lookup, Joining small to large data
exec command, Controlling Pig from Grunt
-execute (-e) command-line option, Command-Line and Configuration Options
EXP function, Built-in math UDFs
explain operator, explain, explain
explicit splits, Nonlinear Data Flows

G

gateway machine, Running Pig on Your Hadoop Cluster
Gaussian distribution, Group
getAllErrorMessages method, Run
getBytesWritten method, Run
getDuration method, Run
getErrorMessage method, Run
getNumberBytes method, Run
getNumberJobs method, Run
getNumberRecords method, Run
getOutputFormat method, Determining OutputFormat
getOutputLocations, getOutputNames methods, Run
getRecordWritten method, Run
getReturnCode method, Run
getUDFContext method, UDFContext
Global Rearrange operator, explain
globs, Load
GNU Public License (GPL) for LZO, Using Compression in Intermediate Results
group by clause, Group, Group
group by operator, How Pig differs from MapReduce
group operator, Group, Group, Parallel, Nonlinear Data Flows, Setting the Partitioner, Filter Early and Often, Evaluation Function Basics
“Group then join in SQL and Pig Latin” example, Comparing query and dataflow languages
Grunt, Grunt
controlling Pig from, Controlling Pig from Grunt
entering Pig Latin scripts in, Entering Pig Latin Scripts in Grunt
explain Pig Latin script in, explain
HDFS commands in, HDFS Commands in Grunt
gt option (HBase), HBase
gte option (HBase), HBase
gzip compression type, Using Compression in Intermediate Results

H

-h properties command-line option, Command-Line and Configuration Options
Hadoop
fs shell commands, HDFS Commands in Grunt
HDFS (Hadoop Distributed File System), Pig on Hadoop, HDFS Commands in Grunt, Constructors and Passing Data from Frontend to Backend, Loading the distributed cache, Writing Load and Store Functions, Determining the location, Hadoop Distributed File System
Java properties used, Command-Line and Configuration Options
metadata in, Metadata in Hadoop
overview, Overview of Hadoop, Hadoop Distributed File System
running Pig on your cluster, Running Pig on Your Hadoop Cluster
tarball, Using Compression in Intermediate Results
tuning, Tune Pig and Hadoop for Your Job
hadoop-site.xml file, Running Pig on Your Hadoop Cluster
Hadoop: The Definitive Guide (White), Tune Pig and Hadoop for Your Job, Overview of Hadoop
handling failure, Handling Failure
hashCode function, Shuffle Phase
HashPartitioner, Shuffle Phase
HBase, Apache, HBase, HBase
HBaseStorage function, Getting the casting functions, HBase, HBase, Built-in Load and Store Functions, Built-in Load and Store Functions
HCatalog, Apache, Metadata in Hadoop
HCatLoader, Using partitions, Pushing down projections
heap size, Joining skewed data, Tune Pig and Hadoop for Your Job, Memory Issues in Eval Funcs
hello world example, MapReduce’s hello world
-help (-h) command-line option, Command-Line and Configuration Options
Hewitt, Eben, Cassandra
highlighting syntax, Syntax Highlighting and Checking
Hive, Apache, Pig and Hive

I

illustrate operator, illustrate
implicit splits, Nonlinear Data Flows
import command, Including Other Pig Latin Scripts
including other Pig Latin scripts, Including Other Pig Latin Scripts
INDEXOF function, Built-in chararray and bytearray UDFs
inner joins, Join, Joining sorted data
input clause (define command), stream
input schemas, Input and Output Schemas
input size, Making Pig Fly
InputFormat, determining, Determining InputFormat
int AVG function, Built-in aggregate UDFs
int functions
INDEXOF, Built-in chararray and bytearray UDFs
LAST_INDEX_OF, Built-in chararray and bytearray UDFs
MAX, Built-in aggregate UDFs
MIN, Built-in aggregate UDFs
int type, Scalar Types, Schemas, Python UDFs
intermediate results size, Making Pig Fly
invoker methods, Calling Static Java Functions
isSuccessful method, Run
iterative processing, What Is Pig Useful For?, Embedding Pig Latin in Python, Binding Multiple Sets of Variables

J

Jackson JSON library, Writing Load and Store Functions
JAR files
downloading, Downloading Pig Artifacts from Maven
Jackson, Writing Load and Store Functions
Jython, Registering Python UDFs
Piggybank, Registering UDFs, Piggybank
pigunit, Testing Your Scripts with PigUnit
registering, Utility Methods, Python UDFs
Java
and Cascading data flows, Cascading
casting and HBase, HBase
compared with Python, Python UDFs
data types used by Pig, Scalar Types, Nulls, Input and Output Schemas
embedding interface, Embedding Pig Latin in Python
evaluation functions in, Writing an Evaluation Function in Java, Memory Issues in Eval Funcs, Built-in Evaluation and Filter Functions
integration with Pig, Pig Philosophy, Downloading the Pig Package from Apache
Iterable, Interacting with Pig values
JUnit, Testing Your Scripts with PigUnit
and MapReduce, Map Phase
memory requirements of, Bag, Joining small to large data
multiple inheritance workaround, Casting bytearrays, Store Functions
passing arguments to, mapreduce
properties used by Pig and Hadoop, Command-Line and Configuration Options, set
reflection, Calling Static Java Functions, Input and Output Schemas, Input and Output Schemas
regular expressions, Filter
setting JAVA_HOME, Downloading the Pig Package from Apache
setting the Partitioner, Setting the Partitioner
static functions, Calling Static Java Functions
UDFs and, User Defined Functions, define and UDFs, Input and Output Schemas, Loading the distributed cache, Overloading UDFs
JobTracker, Running Pig on Your Hadoop Cluster, MapReduce Job Status, Error Handling and Progress Reporting, MapReduce
join operator, Parallel
joining small to large data, Joining small to large data, Distributed Cache
joining sorted data, Joining sorted data
joins
default behavior, Join, Join
and filter pushing, Filter Early and Often
how to update every five minutes, What Is Pig Useful For?
inner, Join, Joining sorted data
input path overwritten, Determining the location
no multiquery for, Nonlinear Data Flows
other implementations, Using Different Join Implementations, cross, Set Up Your Joins Properly
outer, Join, Joining small to large data
parallel clause and, Parallel
partition clause and, Setting the Partitioner
in Pig Latin versus MapReduce, How Pig differs from MapReduce
in Pig Latin versus SQL, Comparing query and dataflow languages
and sample records, illustrate
sort-merge, Joining sorted data
JSON, Schemas
JsonLoader example, Interacting with Pig values, Writing Load and Store Functions, Loading metadata
JsonStorage example, Determining OutputFormat, Storing Metadata
JUnit, Testing Your Scripts with PigUnit
Jython, User Defined Functions, Registering Python UDFs, Python UDFs

L

LAST_INDEX_OF function, Built-in chararray and bytearray UDFs
LCFIRST function, Built-in chararray and bytearray UDFs
Le Dem, Julien, Embedding Pig Latin in Python
licensing, What Is Pig?, Using Compression in Intermediate Results
limit operator, Limit, Parallel, Nested foreach
limit option (HBase), HBase
LimitOptimizer optimization, Debugging Tips
linear data flows, Nonlinear Data Flows
load clause (mapreduce statement), mapreduce
load function (PigStorage), Choose the Right Data Type
load functions (Pig), Load Functions, Pushing down projections
additional interfaces, Additional Load Function Interfaces, Pushing down projections
backend data reading, Backend Data Reading, Reading records
built-in, Built-in Load and Store Functions
frontend planning functions, Frontend Planning Functions, Passing Information from the Frontend to the Backend
loading metadata, Loading metadata
passing info frontend to backend, Passing Information from the Frontend to the Backend
load operator, Load, explain, Filter Early and Often
loadKey option (HBase), HBase
local mode, Running Pig Locally on Your Machine
Local Rearrange operator, explain
LOG function, Built-in math UDFs
LOG10 function, Built-in math UDFs
logical optimizer, Debugging Tips
logical plan, explain, Debugging Tips
LogicalExpressionsSimplifier optimization, Debugging Tips
logs, MapReduce Job Status, Error Handling and Progress Reporting
long AVG function, Built-in aggregate UDFs
long functions
COUNT, Built-in aggregate UDFs
COUNT_STAR, Built-in aggregate UDFs
MAX, Built-in aggregate UDFs
MIN, Built-in aggregate UDFs
ROUND, Built-in math UDFs
SIZE, Built-in chararray and bytearray UDFs, Built-in complex type UDFs
SUM, Built-in aggregate UDFs
long type, Scalar Types, Schemas, Python UDFs
lookup table, constructing, Constructors and Passing Data from Frontend to Backend
LOWER function, Built-in chararray and bytearray UDFs
lt option (HBase), HBase
lte option (HBase), HBase
LZO compression type, Using Compression in Intermediate Results

M

macros, Macros
map data type, Map, Schemas, Python UDFs
map only jobs, Reduce Phase
map parallelism, Parallel
map phase, Pig on Hadoop, Map Phase
map projection operator (#), Expressions in foreach
map TOMAP function, Built-in complex type UDFs
MapReduce, Pig on Hadoop, MapReduce
how Pig differs from, How Pig differs from MapReduce, How Pig differs from MapReduce
integrating with Pig, mapreduce
job status, MapReduce Job Status
performance tuning properties, Tune Pig and Hadoop for Your Job
mapreduce operator, mapreduce, Filter Early and Often
“Mary had a Little Lamb” example, MapReduce’s hello world
Maven, downloading Pig from, Downloading Pig Artifacts from Maven
MAX functions, Built-in aggregate UDFs
memory
buffer size, Tune Pig and Hadoop for Your Job
requirements for Pig data types, Bag
size, Making Pig Fly
merge join, Joining sorted data, Set Up Your Joins Properly
MergeFilter optimization, Debugging Tips
MergeForEach optimization, Debugging Tips
metadata
in Hadoop, Metadata in Hadoop
loading, Loading metadata
storing, Storing Metadata
metropolitan name example, Constructors and Passing Data from Frontend to Backend, Loading the distributed cache
MIN functions, Overloading UDFs, Built-in aggregate UDFs
multiple bindings, running, Running Multiple Bindings
multiple joins, Join
multiple keys, grouping on, Group
multiquery, Nonlinear Data Flows, Use Multiquery When Possible
multiway joins, Joining skewed data

N

NameNode, Running Pig on Your Hadoop Cluster, Joining small to large data, Data Layout Optimization, Loading the distributed cache, Distributed Cache, Hadoop Distributed File System
namespaces, Registering Python UDFs
nested foreach, Nested foreach, Nested foreach
noise words, Join
nonlinear data flows, Nonlinear Data Flows, Nonlinear Data Flows
NoSQL databases, NoSQL Databases
null, Nulls, Expressions in foreach, Filter, Join, Error Handling and Progress Reporting
NYSE examples
average dividends, Running Pig Locally on Your Machine
buy/sell analyzer, UDFContext
daily sorted dividends, Joining sorted data
data set, Code Examples in This Book
dividends increased between two dates, Join
filter out low-dividend stocks, stream
find list of ticker symbols, Distinct
number of unique stock symbols, Nested foreach
stock-price changes on dividend days, Macros
top three dividends, Nested foreach
total trade estimate, Casts
tracking a stock over time, Nested foreach

O

Olston, Christopher, Pig’s History
optimizations, turning off, Debugging Tips, Debugging Tips
optimizing scripts, Making Pig Fly, Bad Record Handling
order by operator, How Pig differs from MapReduce, Order by
order operator, Order by, Order by, Parallel, Nested foreach, Setting the Partitioner
outer joins, Join, Joining small to large data
output clause (define command), stream
output location, Setting the output location
output phase, Output Phase
output schemas, Input and Output Schemas
output size, Making Pig Fly
OutputFormat, Store Functions, Output Phase
overloading, Calling Static Java Functions, Overloading UDFs

P

Package operator, explain
page rank, calculating from web crawl, Embedding Pig Latin in Python, Utility Methods
parallel clause, Parallel
parallel dataflow language, Pig Latin, a Parallel Dataflow Language
parallelism, Select the Right Level of Parallelism, Where Your UDF Will Run, Writing Load and Store Functions
parameter substitution, Parameter Substitution, Parameter Substitution
partition clause, Setting the Partitioner
Partitioner class, Setting the Partitioner, Shuffle Phase
partitions, using, Using partitions
performance tuning properties (MapReduce), Tune Pig and Hadoop for Your Job
philosophy of Pig, Pig Philosophy
physical plan, explain
Pig
downloading and installing, Downloading and Installing Pig, Downloading the Source
fs method, Utility Methods
history, Pig’s History
integrating with legacy code and MapReduce, Integrating Pig with Legacy Code and MapReduce, mapreduce
issue-tracking system, Downloading the Source
performance tuning, Tune Pig and Hadoop for Your Job
philosophy, Pig Philosophy
portability, Downloading the Pig Package from Apache
release page, Downloading the Pig Package from Apache
running, Running Pig, Command-Line and Configuration Options
strength of typing, Casts
translation to Python types, Python UDFs
version control page, Downloading the Source
“Pig counts Mary and her lamb” example, MapReduce’s hello world
Pig Latin, What Is Pig?
best use cases for, What Is Pig Useful For?
case sensitivity, Case Sensitivity
comment operators, Comments
developing and testing scripts, Developing and Testing Pig Latin Scripts, Testing Your Scripts with PigUnit
embedding in Python, Embedding Pig Latin in Python, Utility Methods
fields, Preliminary Matters
input and output, Input and Output, Dump
preprocessor, Pig Latin Preprocessor, Including Other Pig Latin Scripts
relational operations, Relational Operations, Parallel
relations, Preliminary Matters
syntax highlighting packages, Syntax Highlighting and Checking
“Pig Latin: A Not-So-Foreign Language for Data Processing” (Olston), Pig’s History
Piggybank, User Defined Functions, Piggybank
PigStats methods, Run
PigStorage function, Store, Getting the casting functions, Built-in Load and Store Functions, Built-in Load and Store Functions
PigUnit, Testing Your Scripts with PigUnit, Testing Your Scripts with PigUnit
pipelines, data, What Is Pig Useful For?, Debugging Tips, Pig and Hive, Metadata in Hadoop
POSIX, Pig on Hadoop, Hadoop Distributed File System
power law distribution, Group
“Practical Skew Handling in Parallel Joins” (DeWitt et al.), Joining skewed data
prepareToRead, Getting ready to read
prepareToWrite method, Preparing to write
prereduce merge, Combiner Phase
projections, pushing down, Pushing down projections
-propertyFile (-P) command-line option, Command-Line and Configuration Options
PushDownForeachFlatten feature, Debugging Tips
PushUpFilter optimization, Debugging Tips
Pygmalion project, Cassandra
Python
embedding Pig Latin in, Embedding Pig Latin in Python, Utility Methods
UDFs, User Defined Functions, Registering Python UDFs, Python UDFs, Python UDFs

R

RANDOM functions, Miscellaneous built-in UDF
raw data, What Is Pig Useful For?, Pig and Hive
RDBMS versus Hadoop environments, Comparing query and dataflow languages, Using Different Join Implementations
RecordWriter class, Preparing to write, Output Phase
reduce phase, Pig on Hadoop, Reduce Phase
reducers, How Pig differs from MapReduce, Group, Order by, Joining skewed data, Select the Right Level of Parallelism, Combiner Phase
reflection, Calling Static Java Functions, Input and Output Schemas, Input and Output Schemas
REGEX_EXTRACT function, Built-in chararray and bytearray UDFs
REGEX_EXTRACT_ALL function, Built-in chararray and bytearray UDFs
register command, Registering UDFs
registerJar utility method, Utility Methods
registerUDF utility method, Utility Methods
regular expressions, Filter
relational operations, Relational Operations, Parallel, Advanced Features of foreach, cross
relations, Preliminary Matters
REPLACE function, Built-in chararray and bytearray UDFs
result method, Run
return codes, Return Codes, Run
returns clause (define statement), Macros
rmr command, HDFS Commands in Grunt
ROUND function, Built-in math UDFs
run command, Controlling Pig from Grunt
running multiple bindings, Running Multiple Bindings
“Running Pig in Local Mode” example, Running Pig Locally on Your Machine
“Running Pig On Your Cluster” example, Running Pig on Your Hadoop Cluster
runSingle command, Run
runtime declaration (schemas), Schemas
runtime exceptions, Input and Output Schemas

S

sampling
illustrate tool, illustrate
sample operator, Sample
scalar types, Scalar Types
schemas, Schemas, Casts, Input and Output Schemas, Input and Output Schemas, Python UDFs, Loading metadata, Checking the schema
scripts
optimizing, Making Pig Fly, Bad Record Handling
testing with PigUnit, Testing Your Scripts with PigUnit, Testing Your Scripts with PigUnit
self joins, Join
semi-join, cogroup
set command, set
set utility method, Utility Methods
setLocation, Determining the location
setOutputPath utility function, Setting the output location
setStoreLocation function, Setting the output location
setting the Partitioner, Setting the Partitioner
ship clause, stream
shuffle phase, Pig on Hadoop, Shuffle Phase
shuffle size, Making Pig Fly
SIN function, Built-in math UDFs
SINH function, Built-in math UDFs
SIZE functions, Built-in chararray and bytearray UDFs, Built-in complex type UDFs
skew joins, Joining skewed data, Setting the Partitioner, Set Up Your Joins Properly, Tune Pig and Hadoop for Your Job
skew, handling of, How Pig differs from MapReduce, Group, Select the Right Level of Parallelism
Hadoop combiner, Group, Algebraic Interface, Combiner Phase
order by operator, Order by
skew joins, Joining skewed data, Setting the Partitioner, Set Up Your Joins Properly, Tune Pig and Hadoop for Your Job
sort command, Filter Early and Often
sort-merge join, Joining sorted data
source code, Downloading the Source
speculative execution, Select the Right Level of Parallelism, Handling Failure
spill files, number of, Tune Pig and Hadoop for Your Job
spilling to disk, Memory Issues in Eval Funcs
split operator, Nonlinear Data Flows, Filter Early and Often
SplitCombination optimization, Debugging Tips
SplitFilter optimization, Debugging Tips
SQL compared/contrasted with Pig
Apache Hive, Pig and Hive
constraints on data, Bag
dataflow and query languages, Comparing query and dataflow languages, Comparing query and dataflow languages
group operator, Group
long COUNT, Built-in aggregate UDFs
noise words, Join
nulls, Filter, Join
optimizers, Using Different Join Implementations
trinary logic, Filter
tuples, Tuple
union, union
use of distinct statement, Distinct
SQL layer (Apache Hive), Pig and Hive
SQRT function, Built-in math UDFs
static Java functions, Calling Static Java Functions
statistics summary, Pig Statistics
stats command, Pig Statistics
stock analyzer example, UDFContext
store clause (mapreduce statement), mapreduce
store functions
built-in, Built-in Load and Store Functions
writing, Writing Load and Store Functions, Store Functions, Storing Metadata
store operator, Store, explain, Filter Early and Often
StoreFunc class, Store Functions
storing metadata, Storing Metadata
stream operator, stream, Filter Early and Often
streams, number of, Tune Pig and Hadoop for Your Job
STRSPLIT functions, Built-in chararray and bytearray UDFs
subqueries, Pig alternative to, Comparing query and dataflow languages
SUBSTRING functions, Built-in chararray and bytearray UDFs
SUM functions, Algebraic Interface, Built-in aggregate UDFs, Built-in aggregate UDFs
svn version control, Downloading the Source
syntax highlighting and checking, Syntax Highlighting and Checking
synthetic join, cross

T

tab delimited files, Choose the Right Data Type
TAN function, Built-in math UDFs
TANH function, Built-in math UDFs
tarball, Hadoop, Downloading the Pig Package from Apache, Using Compression in Intermediate Results
TaskTracker, MapReduce, Hadoop Distributed File System
testing scripts with PigUnit, Testing Your Scripts with PigUnit, Testing Your Scripts with PigUnit
TextLoader function, Built-in Load and Store Functions
TextMate syntax highlighting, Syntax Highlighting and Checking
theta joins, cross
threshold usage, Tune Pig and Hadoop for Your Job
TOBAG function, Built-in complex type UDFs
TOKENIZE function, Built-in chararray and bytearray UDFs
TOMAP function, Built-in complex type UDFs
TOP function, Built-in complex type UDFs
TOTUPLE function, Built-in complex type UDFs
TRIM function, Built-in chararray and bytearray UDFs
trinary logic, Filter
tuning Pig and Hadoop, Tune Pig and Hadoop for Your Job
tuple data type, Tuple, Schemas, Interacting with Pig values, Python UDFs
tuple projection operator (.), Expressions in foreach
tuple TOTUPLE function, Built-in complex type UDFs
TupleFactory class, Interacting with Pig values
Turing Complete Pig, Embedding Pig Latin in Python
turning off features, Debugging Tips
typechecking, Input and Output Schemas, Overloading UDFs
types, data, Types, Nulls, Python UDFs

U

UCFIRST function, Built-in chararray and bytearray UDFs
UDFContext class, UDFContext, Store Functions and UDFContext
UDFs (User Defined Functions), Code Examples in This Book, User Defined Functions
built-in, Built-in UDFs, Miscellaneous built-in UDF
define and, define and UDFs
error handling, Error Handling and Progress Reporting
in foreach, UDFs in foreach
naming, Writing an Evaluation Function in Java
optimizing, Writing Your UDF to Perform
overloading, Overloading UDFs
registering, Registering UDFs, Registering Python UDFs
where your UDF will run, Where Your UDF Will Run
union operator, How Pig differs from MapReduce, union, Nonlinear Data Flows, Filter Early and Often, Determining the location
UPPER function, Built-in chararray and bytearray UDFs
User Defined Functions (see UDFs)
using clause (load function), Load
using clause (store function), Store
Utf8StorageConverter, Casting bytearrays
utility methods, Utility Methods

V

variables, binding multiple sets of, Binding Multiple Sets of Variables
-version command-line option, Command-Line and Configuration Options
version control with git, Downloading the Source
version differences in Hadoop
file locations, Running Pig on Your Hadoop Cluster
globs, Load
version differences in Pig
.. field range, Expressions in foreach
built-in eval and filter functions, Built-in Evaluation and Filter Functions, Miscellaneous built-in UDF
bytesToMap methods, Casting bytearrays
column families, HBase
data layout optimization, Data Layout Optimization
dependencies inside Python scripts, Registering Python UDFs
dump output, Dump
EvalFunc, Loading the distributed cache
flatten schema bug, flatten
globs accepted by register, Registering UDFs
Grunt command sh, HDFS Commands in Grunt
hadoop fs shell commands, Running Pig on Your Hadoop Cluster, HDFS Commands in Grunt
Hadoop requirements, Downloading the Pig Package from Apache
handling of Java properties, Command-Line and Configuration Options
HDFS paths for register, Registering UDFs
illustrate, illustrate
invoker methods, Calling Static Java Functions
Java eval funcs, Writing Evaluation and Filter Functions
joins, Joining skewed data, Joining sorted data
load and store functions, Writing Load and Store Functions
local mode execution, Running Pig Locally on Your Machine
logical optimizer and plan, Debugging Tips, Project Early and Often
macros, Macros
map declared values, Map
map schemas, Input and Output Schemas
mapreduce command, mapreduce
non-Java UDFs, User Defined Functions
number of output records in a bag, cross
parallel level, Parallel
PigUnit, Testing Your Scripts with PigUnit
preprocessor actions, Pig Latin Preprocessor, Including Other Pig Latin Scripts
Python, Embedding Pig Latin in Python, Writing Evaluation and Filter Functions, Python UDFs
runtime adaption code, Schemas
setting the Partitioner, Setting the Partitioner
summary statistics, Pig Statistics
truncation and null padding, Schemas
UDFContext class, UDFContext
UDFs languages, User Defined Functions
Vim syntax highlighting, Syntax Highlighting and Checking

W

warn method, Error Handling and Progress Reporting
web crawl
calculating page rank from, Embedding Pig Latin in Python, Utility Methods
data set, Embedding Pig Latin in Python, Utility Methods
White, Tom, Tune Pig and Hadoop for Your Job, Overview of Hadoop
word count example, MapReduce’s hello world
writing MapReduce in Java, compared to Pig Latin, How Pig differs from MapReduce
writing records, Writing records, Writing records

Y

Yahoo!, Pig’s History