Chapter 10. Working with Text

If you’ve been reading this book sequentially, you’ve read all about the core Java language constructs, including the object-oriented aspects of the language and the use of threads. Now it’s time to shift gears and start talking about the Java Application Programming Interface (API), the collection of classes that compose the standard Java packages and come with every Java implementation. Java’s core packages are one of its most distinguishing features. Many other object-oriented languages have similar features, but none has as extensive a set of standardized APIs and tools as Java does. This is both a reflection of and a reason for Java’s success. Table 10-1 lists some of the important packages in the API and their corresponding chapters in this book.

Table 10-1. Java API packages

Package

Contents

Chapter

java.lang

Basic language classes

4–9

java.lang.reflect

Reflection

7

java.util.concurrent

Thread utilities

9

java.text

java.util.regex

International text classes and regular expressions

10

java.util

Utilities and collections classes

10–12

java.io

java.nio

Input and output

Input and output

12

12

java.net

Networking and Remote Method Invocation classes

13–14

java.rmi

Remote Method Invocation classes

13

javax.servlet

Web applications

15

javax.swing

java.awt

Swing GUI and 2D graphics

16–20

java.awt.image

javax.imageio

javax.media

Images, sound, and video

21

java.beans

JavaBeans API

22

java.applet

The Applet API

23

javax.xml

The XML API

24

As you can see in Table 10-1, we have examined some classes in java.lang in earlier chapters while looking at the core language constructs. Starting with this chapter, we throw open the Java toolbox and begin examining the rest of the API classes, starting with text-related utilities, because they are fundamental to all kinds of applications.

Text-Related APIs

In this chapter, we cover most of the special-purpose, text-related APIs in Java, from simple classes for parsing words and numbers to advanced text formatting, internationalization, and regular expressions. But because so much of what we do with computers is oriented around text, classifying APIs as strictly text-related can be somewhat arbitrary. Some of the text-related packages we cover in the next chapter include the Java Calendar API, the Properties and User Preferences APIs, and the Logging API. But some of the most important tools in the text arena are those for working with the Extensible Markup Language, XML. In Chapter 24, we cover XML in detail, along with the XSL/XSLT stylesheet language. Together they provide a powerful framework for rendering documents.

Strings

We’ll start by taking a closer look at the Java String class (or, more specifically, java.lang.String). Because working with Strings is so fundamental, it’s important to understand how they are implemented and what you can do with them. A String object encapsulates a sequence of Unicode characters. Internally, these characters are stored in a regular Java array, but the String object guards this array jealously and gives you access to it only through its own API. This is to support the idea that Strings are immutable; once you create a String object, you can’t change its value. Lots of operations on a String object appear to change the characters or length of a string, but what they really do is return a new String object that copies or internally references the needed characters of the original. Java implementations make an effort to consolidate identical strings used in the same class into a shared-string pool and to share parts of Strings where possible.

The original motivation for all of this was performance. Immutable Strings can save memory and be optimized for speed by the Java VM. The flip side is that a programmer should have a basic understanding of the String class in order to avoid creating an excessive number of String objects in places where performance is an issue. That was especially true in the past, when VMs were slow and handled memory poorly. Nowadays, string usage is not usually an issue in the overall performance of a real application.[29]

Constructing Strings

Literal strings, defined in your source code, are declared with double quotes and can be assigned to a String variable:

    String quote = "To be or not to be";

Java automatically converts the literal string into a String object and assigns it to the variable.

Strings keep track of their own length, so String objects in Java don’t require special terminators. You can get the length of a String with the length() method. You can also test for a zero length string by using isEmpty():

    int length = quote.length();
    boolean empty = quote.isEmpty();

Strings can take advantage of the only overloaded operator in Java, the + operator, for string concatenation. The following code produces equivalent strings:

    String name = "John " + "Smith";
    String name = "John ".concat("Smith");

Literal strings can’t span lines in Java source files, but we can concatenate lines to produce the same effect:

    String poem =
        "'Twas brillig, and the slithy toves\n" +
        "   Did gyre and gimble in the wabe:\n" +
        "All mimsy were the borogoves,\n" +
        "   And the mome raths outgrabe.\n";

Embedding lengthy text in source code is not normally something you want to do. In this and the following chapter, we’ll talk about ways to load Strings from files, special packages called resource bundles, and URLs. Technologies like Java Server Pages and template engines also provide a way to factor out large amounts of text from your code. For example, in Chapter 14, we’ll see how to load our poem from a web server by opening a URL like this:

    InputStream poem = new URL(
        "http://myserver/~dodgson/jabberwocky.txt").openStream();

In addition to making strings from literal expressions, you can construct a String directly from an array of characters:

    char [] data = new char [] { 'L', 'e', 'm', 'm', 'i', 'n', 'g' };
    String lemming = new String( data );

You can also construct a String from an array of bytes:

    byte [] data = new byte [] { (byte)97, (byte)98, (byte)99 };
    String abc = new String(data, "ISO8859_1");

In this case, the second argument to the String constructor is the name of a character-encoding scheme. The String constructor uses it to convert the raw bytes in the specified encoding to the internally used standard 2-byte Unicode characters. If you don’t specify a character encoding, the default encoding scheme on your system is used. We’ll discuss character encodings more when we talk about the Charset class, IO, in Chapter 12.[30]

Conversely, the charAt() method of the String class lets you access the characters of a String in an array-like fashion:

    String s = "Newton";
    for ( int i = 0; i < s.length(); i++ )
        System.out.println( s.charAt( i ) );

This code prints the characters of the string one at a time. Alternately, we can get the characters all at once with toCharArray(). Here’s a way to save typing a bunch of single quotes and get an array holding the alphabet:

    char [] abcs = "abcdefghijklmnopqrstuvwxyz".toCharArray();

The notion that a String is a sequence of characters is also codified by the String class implementing the interface java.lang.CharSequence, which prescribes the methods length() and charAt() as well as a way to get a subset of the characters.

Strings from Things

Objects and primitive types in Java can be turned into a default textual representation as a String. For primitive types like numbers, the string should be fairly obvious; for object types, it is under the control of the object itself. We can get the string representation of an item with the static String.valueOf() method. Various overloaded versions of this method accept each of the primitive types:

    String one = String.valueOf( 1 ); // integer, "1"
    String two = String.valueOf( 2.384f );  // float, "2.384"
    String notTrue = String.valueOf( false ); // boolean, "false"

All objects in Java have a toString() method that is inherited from the Object class. For many objects, this method returns a useful result that displays the contents of the object. For example, a java.util.Date object’s toString() method returns the date it represents formatted as a string. For objects that do not provide a representation, the string result is just a unique identifier that can be used for debugging. The String.valueOf() method, when called for an object, invokes the object’s toString() method and returns the result. The only real difference in using this method is that if you pass it a null object reference, it returns the String “null” for you, instead of producing a NullPointerException:

    Date date = new Date();
    // Equivalent, e.g., "Fri Dec 19 05:45:34 CST 1969"
    String d1 = String.valueOf( date );
    String d2 = date.toString();

    date = null;
    d1 = String.valueOf( date );  // "null"
    d2 = date.toString();  // NullPointerException!

String concatenation uses the valueOf() method internally, so if you “add” an object or primitive using the plus operator (+), you get a String:

    String today = "Today's date is :" + date;

You’ll sometimes see people use the empty string and the plus operator (+) as shorthand to get the string value of an object. For example:

    String two = "" + 2.384f;
    String today = "" + new Date();

Comparing Strings

The standard equals() method can compare strings for equality; they contain exactly the same characters in the same order. You can use a different method, equalsIgnoreCase(), to check the equivalence of strings in a case-insensitive way:

    String one = "FOO";
    String two = "foo";

    one.equals( two );             // false
    one.equalsIgnoreCase( two );   // true

A common mistake for novice programmers in Java is to compare strings with the == operator when they intend to use the equals() method. Remember that strings are objects in Java, and == tests for object identity; that is, whether the two arguments being tested are the same object. In Java, it’s easy to make two strings that have the same characters but are not the same string object. For example:

    String foo1 = "foo";
    String foo2 = String.valueOf( new char [] { 'f', 'o', 'o' }  );

    foo1 == foo2         // false!
    foo1.equals( foo2 )  // true

This mistake is particularly dangerous because it often works for the common case in which you are comparing literal strings (strings declared with double quotes right in the code). The reason for this is that Java tries to manage strings efficiently by combining them. At compile time, Java finds all the identical strings within a given class and makes only one object for them. This is safe because strings are immutable and cannot change. You can coalesce strings yourself in this way at runtime using the String intern() method. Interning a string returns an equivalent string reference that is unique across the VM.

The compareTo() method compares the lexical value of the String to another String, determining whether it sorts alphabetically earlier than, the same as, or later than the target string. It returns an integer that is less than, equal to, or greater than zero:

    String abc = "abc";
    String def = "def";
    String num = "123";

    if ( abc.compareTo( def ) < 0 )         // true
    if ( abc.compareTo( abc ) == 0 )        // true
    if ( abc.compareTo( num ) > 0 )         // true

The compareTo() method compares strings strictly by their characters’ positions in the Unicode specification. This works for simple text but does not handle all language variations well. The Collator class, discussed next, can be used for more sophisticated comparisons.

The Collator class

The java.text package provides a sophisticated set of classes for comparing strings in specific languages. German, for example, has vowels with umlauts and another character that resembles the Greek letter beta and represents a double “s.” How should we sort these? Although the rules for sorting such characters are precisely defined, you can’t assume that the lexical comparison we used earlier has the correct meaning for languages other than English. Fortunately, the Collator class takes care of these complex sorting problems.

In the following example, we use a Collator designed to compare German strings. You can obtain a default Collator by calling the Collator.getInstance() method with no arguments. Once you have an appropriate Collator instance, you can use its compare() method, which returns values just like String’s compareTo() method. The following code creates two strings for the German translations of “fun” and “later,” using Unicode constants for these two special characters. It then compares them, using a Collator for the German locale. (Locales help you deal with issues relevant to particular languages and cultures; we’ll talk about them in detail later in this chapter.) The result in this case is that “fun” (Spaß) sorts before “later” (später):

    String fun = "Spa\u00df";
    String later = "sp\u00e4ter";

    Collator german = Collator.getInstance(Locale.GERMAN);
    if (german.compare(fun, later) < 0) // true

Using collators is essential if you’re working with languages other than English. In Spanish, for example, “ll” and “ch” are treated as unique characters and alphabetized separately. A collator handles cases like these automatically.

Searching

The String class provides several simple methods for finding fixed substrings within a string. The startsWith() and endsWith() methods compare an argument string with the beginning and end of the String, respectively:

    String url = "http://foo.bar.com/";
    if ( url.startsWith("http:") )  // true

The indexOf() method searches for the first occurrence of a character or substring and returns the starting character position, or -1 if the substring is not found:

    String abcs = "abcdefghijklmnopqrstuvwxyz";
    int i = abcs.indexOf( 'p' );     // 15
    int i = abcs.indexOf( "def" );   // 3
    int I = abcs.indexOf( "Fang" );  // -1

Similarly, lastIndexOf() searches backward through the string for the last occurrence of a character or substring.

The contains() method handles the very common task of checking to see whether a given substring is contained in the target string:

    String log = "There is an emergency in sector 7!";
    if  ( log.contains("emergency") ) pageSomeone();

    // equivalent to
    if ( log.indexOf("emergency") != -1 ) ...

For more complex searching, you can use the Regular Expression API, which allows you to look for and parse complex patterns. We’ll talk about regular expressions later in this chapter.

Editing

A number of methods operate on the String and return a new String as a result. While this is useful, you should be aware that creating lots of strings in this manner can affect performance. If you need to modify a string often or build a complex string from components, you should use the StringBuilder class, as we’ll discuss shortly.

trim() is a useful method that removes leading and trailing whitespace (i.e., carriage return, newline, and tab) from the String:

    String str = "   abc   ";
    str = str.trim();  // "abc"

In this example, we threw away the original String (with excess whitespace), and it will be garbage-collected.

The toUpperCase() and toLowerCase() methods return a new String of the appropriate case:

    String down = "FOO".toLowerCase();      // "foo"
    String up   = down.toUpperCase();       // "FOO"

substring() returns a specified range of characters. The starting index is inclusive; the ending is exclusive:

    String abcs = "abcdefghijklmnopqrstuvwxyz";
    String cde = abcs.substring( 2, 5 ); // "cde"

The replace() method provides simple, literal string substitution. One or more occurrences of the target string are replaced with the replacement string, moving from beginning to end. For example:

    String message = "Hello NAME, how are you?".replace( "NAME", "Penny" );
    // "Hello Penny, how are you?"
    String xy = "xxooxxxoo".replace( "xx", "X" );
    // "XooXxoo"

The String class also has two methods that allow you to do more complex pattern substitution: replaceAll() and replaceFirst(). Unlike the simple replace() method, these methods use regular expressions (a special syntax) to describe the replacement pattern, which we’ll cover later in this chapter.

String Method Summary

Table 10-2 summarizes the methods provided by the String class.

Table 10-2. String methods

Method

Functionality

charAt()

Gets a particular character in the string

compareTo()

Compares the string with another string

concat()

Concatenates the string with another string

contains()

Checks whether the string contains another string

copyValueOf()

Returns a string equivalent to the specified character array

endsWith()

Checks whether the string ends with a specified suffix

equals()

Compares the string with another string

equalsIgnoreCase()

Compares the string with another string, ignoring case

getBytes()

Copies characters from the string into a byte array

getChars()

Copies characters from the string into a character array

hashCode()

Returns a hashcode for the string

indexOf()

Searches for the first occurrence of a character or substring in the string

intern()

Fetches a unique instance of the string from a global shared-string pool

isEmpty()

Returns true if the string is zero length

lastIndexOf()

Searches for the last occurrence of a character or substring in a string

length()

Returns the length of the string

matches()

Determines if the whole string matches a regular expression pattern

regionMatches()

Checks whether a region of the string matches the specified region of another string

replace()

Replaces all occurrences of a character in the string with another character

replaceAll()

Replaces all occurrences of a regular expression pattern with a pattern

replaceFirst()

Replaces the first occurrence of a regular expression pattern with a pattern

split()

Splits the string into an array of strings using a regular expression pattern as a delimiter

startsWith()

Checks whether the string starts with a specified prefix

substring()

Returns a substring from the string

toCharArray()

Returns the array of characters from the string

toLowerCase()

Converts the string to lowercase

toString()

Returns the string value of an object

toUpperCase()

Converts the string to uppercase

trim()

Removes leading and trailing whitespace from the string

valueOf()

Returns a string representation of a value

StringBuilder and StringBuffer

In contrast to the immutable string, the java.lang.StringBuilder class is a modifiable and expandable buffer for characters. You can use it to create a big string efficiently. StringBuilder and StringBuffer are twins; they have exactly the same API. StringBuilder was added in Java 5.0 as a drop-in, unsynchronized replacement for StringBuffer. We’ll come back to that in a bit.

First, let’s look at some examples of String construction:

    // Could be better
    String ball = "Hello";
    ball = ball + " there.";
    ball = ball + " How are you?";

This example creates an unnecessary String object each time we use the concatenation operator (+). Whether this is significant depends on how often this code is run and how big the string actually gets. Here’s a more extreme example:

    // Bad use of + ...
    while( (line = readLine()) != EOF )
        text += line;

This example repeatedly produces new String objects. The character array must be copied over and over, which can adversely affect performance. The solution is to use a StringBuilder object and its append() method:

    StringBuilder sb = new StringBuilder("Hello");
    sb.append(" there.");
    sb.append(" How are you?");

    StringBuilder text = new StringBuilder();
    while( (line = readline()) != EOF )
        text.append( line );

Here, the StringBuilder efficiently handles expanding the array as necessary. We can get a String back from the StringBuilder with its toString() method:

    String message = sb.toString();

You can also retrieve part of a StringBuilder as a String by using one of the substring() methods.

You might be interested to know that when you write a long expression using string concatenation, the compiler generates code that uses a StringBuilder behind the scenes:

    String foo = "To " + "be " + "or";

It is really equivalent to:

    String foo = new
      StringBuilder().append("To ").append("be ").append("or").toString();

In this case, the compiler knows what you are trying to do and takes care of it for you.

The StringBuilder class provides a number of overloaded append() methods for adding any type of data to the buffer. StringBuilder also provides a number of overloaded insert() methods for inserting various types of data at a particular location in the string buffer. Furthermore, you can remove a single character or a range of characters with the deleteCharAt() and delete() methods. Finally, you can replace part of the StringBuilder with the contents of a String using the replace() method. The String and StringBuilder classes cooperate so that, in some cases, no copy of the data has to be made; the string data is shared between the objects.

You should use a StringBuilder instead of a String any time you need to keep adding characters to a string; it’s designed to handle such modifications efficiently. You can convert the StringBuilder to a String when you need it, or simply concatenate or print it anywhere you’d use a String.

As we said earlier, StringBuilder was added in Java 5.0 as a replacement for StringBuffer. The only real difference between the two is that the methods of StringBuffer are synchronized and the methods of StringBuilder are not. This means that if you wish to use StringBuilder from multiple threads concurrently, you must synchronize the access yourself (which is easily accomplished). The reason for the change is that most simple usage does not require any synchronization and shouldn’t have to pay the associated penalty (slight as it is).

Internationalization

The Java VM lets us write code that executes in the same way on any Java platform. But in a global marketplace, that is only half the battle. A big question remains: will the application content and data be understandable to end users worldwide? Must users know English to use your application? The answer is that Java provides thorough support for localizing the text of your application for most modern languages and dialects. In this section, we’ll talk about the concepts of internationalization (often abbreviated “I18N”) and the classes that support them.

The java.util.Locale Class

Internationalization programming revolves around the Locale class. The class itself is very simple; it encapsulates a country code, a language code, and a rarely used variant code. Commonly used languages and countries are defined as constants in the Locale class. (Maybe it’s ironic that these names are all in English.) You can retrieve the codes or readable names, as follows:

    Locale l = Locale.ITALIAN;
    System.out.println(l.getCountry());            // IT
    System.out.println(l.getDisplayCountry());     // Italy
    System.out.println(l.getLanguage());           // it
    System.out.println(l.getDisplayLanguage());    // Italian

The country codes comply with ISO 3166. You will find a complete list of country codes at the RIPE Network Coordination Centre. The language codes comply with ISO 639. A complete list of language codes is online at the US government website. There is no official set of variant codes; they are designated as vendor-specific or platform-specific. You can get an array of all supported Locales with the static getAvailableLocales() method (which you might use to let your users choose). Or you can retrieve the default Locale for the location where your code is running with the static Locale.getDefault() method and let the system decide for you.

Many classes throughout the Java API use a Locale to decide how to represent text. We ran into one earlier when talking about sorting text with the Collator class. We’ll see more later in this chapter used to format numbers and currency strings, and again in the next chapter with the DateFormat class, which uses Locales to determine how to format and parse dates and times. Without getting into the details yet, here is a quick example:

    System.out.printf( Locale.ITALIAN, "%f\n", 3.14 ); // "3,14"

The preceding statement uses the Italian Locale to indicate that the decimal number 3.14 should be formatted as it would in Italian, using a comma instead of a decimal point. We’ll talk more about formatting text later in this chapter.

Resource Bundles

Before we move on to the details of formatting messages and values, we might take a step back and ask a bigger question: what about the messages themselves? How can we write and manage applications that are truly multilingual in their user interfaces and in all the messages they display to the user? We can discover our locale, but how do we manage all of the application text in our code? The ResourceBundle class offers a clean, flexible solution for factoring out the text and resources of your application into language-specific classes or text files.

A ResourceBundle is a collection of objects that your application can access by name. It acts much like the Hashtable or Map collections we’ll discuss in Chapter 11, looking up objects based on Strings that serve as keys. A ResourceBundle of a given name may be defined for many different Locales. To get a particular ResourceBundle, call the factory method ResourceBundle.getBundle(), which accepts the name of the ResourceBundle and a Locale. The following example gets the ResourceBundle named “Message” for two Locales; from each bundle, it retrieves the message whose key is “HelloMessage” and prints the message:

    import java.util.*;

    public class Hello {
      public static void main(String[] args) {
        ResourceBundle bun;
        bun = ResourceBundle.getBundle("Message", Locale.ITALY);
        System.out.println(bun.getString("HelloMessage"));
        bun = ResourceBundle.getBundle("Message", Locale.US);
        System.out.println(bun.getString("HelloMessage"));
      }
    }

The getBundle() method throws the runtime exception MissingResourceException if an appropriate ResourceBundle cannot be located.

You can provide ResourceBundles in two ways: either as compiled Java classes (hard-coded Java) or as simple property files. Resource bundles implemented as classes are either subclasses of ListResourceBundle or direct implementations of ResourceBundle. Resource bundles backed by a property file are represented at runtime by a PropertyResourceBundle object. ResourceBundle.getBundle() returns either a matching class or an instance of PropertyResourceBundle corresponding to a matching property file. The algorithm used by getBundle() is based on appending the country and language codes of the requested Locale to the name of the resource. Specifically, it searches for resources in this order:

    name_language_country_variant
    name_language_country
    name_language
    name
    name_default-language_default-country_default-variant
    name_default-language_default-country
    name_default-language

In this example, when we try to get the ResourceBundle named Message, specific to Locale.ITALY, it searches for the following names (no variant codes are in the Locales we are using):

    Message_it_IT
    Message_it
    Message
    Message_en_US
    Message_en

Let’s define the Message_it_IT ResourceBundle as a hardcoded class, a subclass of ListResourceBundle:

    import java.util.*;

    public class Message_it_IT extends ListResourceBundle {
      public Object[][] getContents() {
        return contents;
      }

      static final Object[][] contents = {
        {"HelloMessage", "Buon giorno, world!"},
        {"OtherMessage", "Ciao."},
      };
    }

ListResourceBundle makes it easy to define a ResourceBundle class; all we have to do is override the getContents() method. This method simply returns a two-dimensional array containing the names and values of its resources. In this example, contents[1][0] is the second key (OtherMessage), and contents [1][1] is the corresponding message (Ciao.).

Let’s define a ResourceBundle for Locale.US. This time, we’ll take the easy way and make a property file. Save the following data in a file called Message_en_US.properties:

    HelloMessage=Hello, world!
    OtherMessage=Bye.

So what happens if somebody runs your program in Locale.FRANCE and no ResourceBundle is defined for that Locale? To avoid a runtime MissingResourceException, it’s a good idea to define a default ResourceBundle. In our example, you can change the name of the property file to Message.properties. That way, if a language- or country-specific ResourceBundle cannot be found, your application can still run (by falling back to this English representation).

Parsing and Formatting Text

Parsing and formatting text is a large, open-ended topic. So far in this chapter, we’ve looked at only primitive operations on strings—creation, basic editing, searching, and turning simple values into strings. Now we’d like to move on to more structured forms of text. Java has a rich set of APIs for parsing and printing formatted strings, including numbers, dates, times, and currency values. We’ll cover most of these topics in this chapter, but we’ll wait to discuss date and time formatting until Chapter 11.

We’ll start with parsing—reading primitive numbers and values as strings and chopping long strings into tokens. Then we’ll go the other way and look at formatting strings and the java.text package. We’ll revisit the topic of internationalization to see how Java can localize parsing and formatting of text, numbers, and dates for particular locales. Finally, we’ll take a detailed look at regular expressions, the most powerful text-parsing tool Java offers. Regular expressions let you define your own patterns of arbitrary complexity, search for them, and parse them from text.

We should mention that you’re going to see a great deal of overlap between the new formatting and parsing APIs (printf and Scanner) introduced in Java 5.0 and the older APIs of the java.text package. The new APIs effectively replace much of the old ones and in some ways are easier to use. Nonetheless, it’s good to know about both because so much existing code uses the older APIs.

Parsing Primitive Numbers

In Java, numbers and Booleans are primitive types—not objects. But for each primitive type, Java also defines a primitive wrapper class. Specifically, the java.lang package includes the following classes: Byte, Short, Integer, Long, Float, Double, and Boolean. We talked about these in Chapter 1, but we bring them up now because these classes hold static utility methods that know how to parse their respective types from strings. Each of these primitive wrapper classes has a static “parse” method that reads a String and returns the corresponding primitive type. For example:

    byte b = Byte.parseByte("16");
    int n = Integer.parseInt( "42" );
    long l = Long.parseLong( "99999999999" );
    float f = Float.parseFloat( "4.2" );
    double d = Double.parseDouble( "99.99999999" );
    boolean b = Boolean.parseBoolean("true");
    // Prior to Java 5.0 use:
    boolean b = new Boolean("true").booleanValue();

Alternately, the java.util.Scanner provides a single API for not only parsing individual primitive types from strings, but reading them from a stream of tokens. This example shows how to use it in place of the preceding wrapper classes:

    byte b = new Scanner("16").nextByte();
    int n = new Scanner("42").nextInt();
    long l = new Scanner("99999999999").nextLong();
    float f = new Scanner("4.2").nextFloat();
    double d = new Scanner("99.99999999").nextDouble();
    boolean b = new Scanner("true").nextBoolean();

We’ll see Scanner used to parse multiple values from a String or stream when we discuss tokenizing text later in this chapter.

Working with alternate bases

It’s easy to parse integer type numbers (byte, short, int, long) in alternate numeric bases. You can use the parse methods of the primitive wrapper classes by simply specifying the base as a second parameter:

    long l = Long.parseLong( "CAFEBABE", 16 );  // l = 3405691582
    byte b = Byte.parseByte ( "12", 8 ); // b = 10

All methods of the Java 5.0 Scanner class described earlier also accept a base as an optional argument:

    long l = new Scanner( "CAFEBABE" ).nextLong( 16 );  // l = 3405691582
    byte b = new Scanner( "12" ).nextByte( 8 ); // b = 10

You can go the other way and convert a long or integer value to a string value in a specified base using special static toString() methods of the Integer and Long classes:

    String s = Long.toString( 3405691582L, 16 );  // s = "cafebabe"

For convenience, each class also has a static toHexString() method for working with base 16:

    String s = Integer.toHexString( 255 ).toUpperCase();  // s = "FF";

Number formats

The preceding wrapper class parser methods handle the case of numbers formatted using only the simplest English conventions with no frills. If these parse methods do not understand the string, either because it’s simply not a valid number or because the number is formatted in the convention of another language, they throw a NumberFormatException:

    // Italian formatting
    double d = Double.parseDouble("1.234,56");  // NumberFormatException

The Scanner API is smarter and can use Locales to parse numbers in specific languages with more elaborate conventions. For example, the Scanner can handle comma-formatted numbers:

    int n = new Scanner("99,999,999").nextInt();

You can specify a Locale other than the default with the useLocale() method. Let’s parse that value in Italian now:

    double d = new Scanner("1.234,56").useLocale( Locale.ITALIAN ).nextDouble();

If the Scanner cannot parse a string, it throws a runtime InputMismatchException:

    double d = new Scanner("garbage").nextDouble(); // InputMismatchException

Prior to Java 5.0, this kind of parsing was accomplished using the java.text package with the NumberFormat class. The classes of the java.text package also allow you to parse additional types, such as dates, times, and localized currency values, that aren’t handled by the Scanner. We’ll look at these later in this chapter.

Tokenizing Text

A common programming task involves parsing a string of text into words or “tokens” that are separated by some set of delimiter characters, such as spaces or commas. The first example contains words separated by single spaces. The second, more realistic problem involves comma-delimited fields.

    Now is the time for all good men (and women)...

    Check Number, Description,      Amount
    4231,         Java Programming, 1000.00

Java has several (unfortunately overlapping) APIs for handling situations like this. The most powerful and useful are the String split() and Scanner APIs. Both utilize regular expressions to allow you to break the string on arbitrary patterns. We haven’t talked about regular expressions yet, but in order to show you how this works we’ll just give you the necessary magic and explain in detail later in this chapter. We’ll also mention a legacy utility, java.util.StringTokenizer, which uses simple character sets to split a string. StringTokenizer is not as powerful, but doesn’t require an understanding of regular expressions.

The String split() method accepts a regular expression that describes a delimiter and uses it to chop the string into an array of Strings:

    String text = "Now is the time for all good men";
    String [] words = text.split("\\s");
    // words = "Now", "is", "the", "time", ...

    String text = "4231,         Java Programming, 1000.00";
    String [] fields = text.split("\\s*,\\s*");
    // fields = "4231", "Java Programming", "1000.00"

In the first example, we used the regular expression \\s, which matches a single whitespace character (space, tab, or carriage return). The split() method returned an array of eight strings. In the second example, we used a more complicated regular expression, \\s*,\\s*, which matches a comma surrounded by any number of contiguous spaces (possibly zero). This reduced our text to three nice, tidy fields.

With the new Scanner API, we could go a step further and parse the numbers of our second example as we extract them:

    String text = "4231,         Java Programming, 1000.00";
    Scanner scanner = new Scanner( text ).useDelimiter("\\s*,\\s*");
    int checkNumber = scanner.nextInt(); // 4231
    String description = scanner.next(); // "Java Programming"
    float amount = scanner.nextFloat();  // 1000.00

Here, we’ve told the Scanner to use our regular expression as the delimiter and then called it repeatedly to parse each field as its corresponding type. The Scanner is convenient because it can read not only from Strings but directly from stream sources, such as InputStreams, Files, and Channels:

    Scanner fileScanner = new Scanner( new File("spreadsheet.csv") );
    fileScanner.useDelimiter( "\\s*,\\s* );
    // ...

Another thing that you can do with the Scanner is to look ahead with the “hasNext” methods to see if another item is coming:

    while( scanner.hasNextInt() ) {
      int n = scanner.nextInt();
      ...
    }

StringTokenizer

Even though the StringTokenizer class that we mentioned is now a legacy item, it’s good to know that it’s there because it’s been around since the beginning of Java and is used in a lot of code. StringTokenizer allows you to specify a delimiter as a set of characters and matches any number or combination of those characters as a delimiter between tokens. The following snippet reads the words of our first example:

    String text = "Now is the time for all good men (and women)...";
    StringTokenizer st = new StringTokenizer( text );

    while ( st.hasMoreTokens() )  {
        String word = st.nextToken();
        ...
    }

We invoke the hasMoreTokens() and nextToken() methods to loop over the words of the text. By default, the StringTokenizer class uses standard whitespace characters—carriage return, newline, and tab—as delimiters. You can also specify your own set of delimiter characters in the StringTokenizer constructor. Any contiguous combination of the specified characters that appears in the target string is skipped between tokens:

    String text = "4231,     Java Programming, 1000.00";
    StringTokenizer st = new StringTokenizer( text, "," );

    while ( st.hasMoreTokens() )  {
       String word = st.nextToken();
       // word = "4231", "     Java Programming", "1000.00"
    }

This isn’t as clean as our regular expression example. Here we used a comma as the delimiter so we get extra leading whitespace in our description field. If we had added space to our delimiter string, the StringTokenizer would have broken our description into two words, “Java” and “Programming,” which is not what we wanted. A solution here would be to use trim() to remove the leading and trailing space on each element.

Printf-Style Formatting

A standard feature that Java adopted from the C language is printf-style string formatting. printf-style formatting utilizes special format strings embedded into text to tell the formatting engine where to place arguments and give detailed specification about conversions, layout, and alignment. The printf formatting methods also make use of variable-length argument lists, which makes working with them much easier. Here is a quick example of printf-formatted output:

    System.out.printf( "My name is %s and I am %d years old\n", name, age );

The printf formatting draws its name from the C language printf() function, so if you’ve done any C programming, this will look familiar. Java has extended the concept, adding some additional type safety and convenience features. Although Java has had some text formatting capabilities in the past (we’ll discuss the java.text package and MessageFormat later), printf formatting was not really feasible until variable-length argument lists and autoboxing of primitive types were added in Java 5.0. (We mention this to explain why these similar APIs both exist in Java.)

Formatter

The primary new tool in our text formatting arsenal is the java.util.Formatter class and its format() method. Several convenience methods can hide the Formatter object from you and you may not need to create a Formatter directly. First, the static String.format() method can be used to format a String with arguments (like the C language sprintf() method):

    String message =
        String.format("My name is %s and I am %d years old.", name, age );

Next, the java.io.PrintStream and java.io.PrintWriter classes, which are used for writing text to streams, have their own format() method. We discuss streams in Chapter 12, but this simply means that you can use this same printf-style formatting for writing strings to any kind of stream, whether it be to System.out standard console output, to a file, or to a network connection.

In addition to the format() method, PrintStream and PrintWriter also have a version of the format method that is actually called printf(). The printf() method is identical to and, in fact, simply delegates to the format() method. It’s there solely as a shout-out to the C programmers and ex-C programmers in the audience.

The Format String

The syntax of the format string is compact and a bit cryptic at first, but not bad once you get used to it. The simplest format string is just a percent sign (%) followed by a conversion character. For example, the following text has two embedded format strings:

    "My name is %s and I am %d years old."

The first conversion character is s, the most general format, which represents a string value; and the second is d, which represents an integer value. There are about a dozen basic conversion characters corresponding to different types and primitives and there are a couple of dozen more that are specifically used for formatting dates and times. We cover the basics here and return to date and time formatting in Chapter 11.

At first glance, some of the conversion characters may not seem to do much. For example, the %s general string conversion in our previous example would actually have handled the job of displaying the numeric age argument just as well as %d. However, these specialized conversion characters accomplish three things. First, they add a level of type safety. By specifying %d, we ensure that only an integer type is formatted at that location. If we make a mistake in the arguments, we get a runtime IllegalFormatConversionException instead of garbage in our string (and your IDE may flag it as well). Second, the format method is Locale-sensitive and capable of displaying numbers, percentages, dates, and times in many different languages just by specifying a Locale as an argument. By telling the Formatter the type of argument with type-specific conversion characters, printf can take into account language-specific localizations. Third, additional flags and fields can be used to govern layout with different meanings for different types of arguments. For example, with floating-point numbers, you can specify a precision in the format string.

The general layout of the embedded format string is as follows:

    %[argument_index$][flags][width][.precision]conversion_type

Following the literal % are a number of optional items before the conversion type character. We’ll discuss these as they come up, but here’s the rundown. The argument index can be used to reorder or reuse individual arguments in the variable-length argument list by referring to them by number. The flags field holds one or more special flag characters governing the format. The width and precision fields control the size of the output for text and the number of digits displayed for floating-point numbers.

String Conversions

The conversion characters s represents the general string conversion type. Ultimately, all of the conversion types produce a String. What we mean is that the general string conversion takes the easy route to turning its argument into a string. Normally, this simply means calling toString() on the object. Since all of the arguments in the variable argument list are autoboxed, they are all Objects. Any primitives are represented by the results of calling toString() on their wrapper classes, which generally return the value as you’d expect. If the argument is null, the result is the String “null.”

More interesting are objects that implement the java.util.Formattable interface. For these, the argument’s formatTo() method is invoked, passing it the flags, width, and precision information and allowing it to return the string to be used. In this way, objects can control their own printf string representation, just as an object can do so using toString().

Width, precision, and justification

For simple text arguments, you can think of the width and precision as a minimum and maximum number of characters to be output. As we’ll see later, for floating-point numeric types, the precision changes meaning slightly and controls the number of digits displayed after the decimal point. We can see the effect on a simple string here:

    System.out.printf("String is '%5s'\n", "A");
    // String is '    A'
    System.out.printf("String is '%.5s'\n", "Happy Birthday!");
    // String is 'Happy'

In the first case, we specified a width of five characters, resulting in spaces being added to pad our argument. In the second example, we used the literal . followed by the precision value of 5 characters to limit the length of the string displayed, so our “Happy Birthday” string is truncated after the first five characters.

When our string was padded, it was right-justified (leading spaces added). You can control this with the flag character literal minus (-). Reversing our example:

    System.out.printf("String is '%-5s'\n", "A");
    // String is 'A    '

And, of course, we can combine all three, specifying a justification flag and a minimum and maximum width. Here is an example that prints words of varying lengths in two columns:

    String [] words =
       new String [] { "abalone", "ape", "antidisestablishmentarianism" };
    System.out.printf( "%-10s %s\n", "Word", "Length" );
    for ( String word : words )
       System.out.printf( "%-10.10s %s\n", word, word.length() );

    // output
    Word       Length
    abalone    7
    ape        3
    antidisest 28

Uppercase

The s conversion’s big brother S indicates that the output of the conversion should be forced to uppercase. Several other primitive and numeric conversion characters follow this pattern, as we’ll see later. For example:

    String word = "abalone";
    System.out.println(" The lucky word is: %S\n", word );
    // The lucky word is: ABALONE

Numbered arguments

You can refer to an arbitrary argument by number from a format string using the %n$ notation. For example, the following code snippet uses the single argument three times:

    System.out.println( "A %1$s is a %1$s is a %1$S...", "rose" );
     // A rose is a rose is a ROSE...

Numbered arguments are useful for two reasons. The first, shown here, is simply for reusing the same argument in different places and with different conversions. The usefulness of this becomes more apparent when we look at Date and Time formatting in Chapter 11, where we may refer to the same item half a dozen times to get individual fields. The second advantage is that numbered arguments give the message the flexibility to reorder the arguments. This is important when you’re using formatting strings to lay out a message for internationalization or customization purposes where convention may dictate a different ordering.

    log.format("Error %d : %s\n", errNo, errMsg );
    // Error 42 : Low Power
    log.format("%2$s (Error %1$d)\n", errNo, errMsg );
    // Low Power (Error 42)

Primitive and Numeric Conversions

Table 10-3 shows character and Boolean conversion characters.

Table 10-3. Character and Boolean conversion characters

Conversion

Type

Description

Example output

c

Character

Formats the result as a Unicode character

a

b, B

Boolean

Formats result as Boolean

true, FALSE

The c conversion character produces a Unicode character:

    System.out.printf("The first letter is: %c\n", 'a' );

The b and B conversion characters output the Boolean value of their arguments. If the argument is null, the output is false. Strangely, if the argument is of a type other than Boolean, the output is true. B is identical to b except that it forces the output to uppercase.

    System.out.printf( "The door is open: %b\n", ( door.status() == OPEN ) );

As for String types, a width value can be specified on c and b conversions to pad the result to a minimum length. Table 10-4 summarizes integer type conversion characters.

Table 10-4. Integer type conversion characters

Conversion

Type

Description

Example output

d

Integer

Formats the result as an integer.

999

x, X

Integer

Formats result as hexadecimal.

FF, 0xCAFE

o

Integer

Formats result as octal integer.

10, 010

h, H

Integer or object

Formats object as hexadecimal number. If object is not an integer, format its hashCode() value or “null” for null value.

7a71e498

The d, x, and o conversion characters handle the integer type values byte, short, int, and long. (The d apparently stands for decimal, which makes little sense in this context.) The h conversion is an oddity probably intended for debugging. Several important flags give additional control over the formatting of these numeric types. See the section “Flags” for details.

A width value can be specified on these conversions to pad the result. Precision values are not allowed on integer conversions.

Table 10-5 lists floating-point type conversion characters.

Table 10-5. Floating-point type conversion characters

Conversion

Type

Description

Example output

f

Floating point

Formats result as decimal number.

3.14

e, E

Floating point

Formats result in scientific notation.

3.000000e+08

g, G

Floating point

Formats result in either decimal or scientific notation depending on value and precision.

3.14, 10.0e-15

a, A

Floating point

Formats result as hexadecimal floating-point number with significand and exponent.

0x1.fep7

The f conversion character is the primary floating-point conversion character. e and g conversions allow for values to be formatted in scientific notation. a complements the ability in Java to assign floating-point values using hexadecimal significand and exponent notation, allowing bit-for-bit floating-point values to be displayed without ambiguity.

As always, a width value may be used to pad results to a minimum length. The precision value of the conversion, as its name suggests, controls the number of digits displayed after the decimal point for floating-point values. The value is rounded as necessary. If no precision value is specified, it defaults to six digits:

    printf("float is %f\n",   1.23456789); // float is 1.234568
    printf("float is %.3f\n", 1.23456789); // float is 1.235
    printf("float is %.1f\n", 1.23456789); // float is 1.2
    printf("float is %.0f\n", 1.23456789); // float is 1

The g conversion character determines whether to use decimal or scientific notation. First, the value is rounded to the specified precision. If the result is less than 10−4 (less than .0001) or if the result is greater than 10precision (10 to the power of the precision value), it is displayed in scientific notation. Otherwise, decimal notation is displayed.

Flags

Table 10-6 summarizes supported flags to use in format strings.

Table 10-6. Flags for format strings

Flag

Arg types

Description

Example output

-

Any

Left-justifies result (pad space on the right)

'foo '

+

Numeric

Prefixes a + sign on positive results

+1

' '

Numeric

Prefixes a space on positive results (aligning them with negative values)

' 1'

0

Numeric

Pads number with leading zeros to accommodate width requirement

000001

,

Numeric

Formats numbers with commas or other Locale-specific grouping characters

1,234,567

(

Numeric

Encloses negative numbers in parentheses (a convention used to show credits)

(42.50)

#

x,X,o

Uses an alternate form for octal and hexadecimal output

0xCAFE, 010

As mentioned earlier, the - flag can be used to left-justify formatted output. The remaining flags affect the display of numeric types as described.

The # alternate form flag can be used to print octal and hexadecimal values with their standard prefixes—0x for hexadecimal or 0 for octal:

    System.out.printf("%1$X, %1$#X", 0xCAFE, 0xCAFE ); // CAFE, 0xCAFE
    System.out.printf("%1$o, %1$#o", 8, 8 ); // 10, 010

Miscellaneous

Table 10-7 lists the remaining formatting items.

Table 10-7. Miscellaneous formatting items

Conversion

Description

%

Produces a literal % character (Unicode \u0025)

n

Produces the platform-specific line separator (e.g., newline or carriage-return, newline)

Formatting with the java.text Package

The java.text package includes, among other things, a set of classes designed for generating and parsing string representations of objects. In this section, we’ll talk about three classes: NumberFormat, ChoiceFormat, and MessageFormat. Chapter 11 describes the DateFormat class. As we said earlier, the classes of the java.text package overlap to a large degree with the capabilities of the Scanner and printf-style Formatter. Despite these new features, a number of areas in the parsing of currencies, dates, and times can only be handled with the java.text package.

The NumberFormat class can be used to format and parse currency, percentages, or plain old numbers. NumberFormat is an abstract class, but it has several useful factory methods that produce formatters for different types of numbers. For example, to format or parse currency strings, use getCurrencyInstance() :

    double salary = 1234.56;
    String here =     // $1,234.56
        NumberFormat.getCurrencyInstance().format(salary);
    String italy =    // L 1.234,56
        NumberFormat.getCurrencyInstance(Locale.ITALY).format(salary);

The first statement generates an American salary, with a dollar sign, a comma to separate thousands, and a period as a decimal point. The second statement presents the same string in Italian, with a lire sign, a period to separate thousands, and a comma as a decimal point. Remember that NumberFormat worries about format only; it doesn’t attempt to do currency conversion. We can go the other way and parse a formatted value using the parse() method, as we’ll see in the next example.

Likewise, getPercentInstance() returns a formatter you can use for generating and parsing percentages. If you do not specify a Locale when calling a getInstance() method, the default Locale is used:

    double progress = 0.44;
    NumberFormat pf = NumberFormat.getPercentInstance();
    System.out.println( pf.format(progress) );    // "44%"
    try {
        System.out.println( pf.parse("77.2%") );  // "0.772"
    }
    catch (ParseException e) {}

And if you just want to generate and parse plain old numbers, use a NumberFormat returned by getInstance() or its equivalent, getNumberInstance() :

    NumberFormat guiseppe = NumberFormat.getInstance(Locale.ITALY);

    // defaults to Locale.US
    NumberFormat joe = NumberFormat.getInstance();

    try {
      double theValue = guiseppe.parse("34.663,252").doubleValue();
      System.out.println(joe.format(theValue));  // "34,663.252"
    }
    catch (ParseException e) {}

We use guiseppe to parse a number in Italian format (periods separate thousands, comma is the decimal point). The return type of parse() is Number, so we use the doubleValue() method to retrieve the value of the Number as a double. Then we use joe to format the number correctly for the default (U.S.) locale.

Here’s a list of the factory methods for text formatters in the java.text package. Again, we’ll look at the DateFormat methods in the next chapter.

    NumberFormat.getCurrencyInstance()
    NumberFormat.getCurrencyInstance(Locale inLocale)
    NumberFormat.getInstance()
    NumberFormat.getInstance(Locale inLocale)
    NumberFormat.getNumberInstance()
    NumberFormat.getNumberInstance(Locale inLocale)
    NumberFormat.getPercentInstance()
    NumberFormat.getPercentInstance(Locale inLocale)

    DateFormat.getDateInstance()
    DateFormat.getDateInstance(int style)
    DateFormat.getDateInstance(int style, Locale aLocale)
    DateFormat.getDateTimeInstance()
    DateFormat.getDateTimeInstance(int dateStyle, int timeStyle)
    DateFormat.getDateTimeInstance(int dateStyle, int timeStyle, Locale aLocale)
    DateFormat.getInstance()
    DateFormat.getTimeInstance()
    DateFormat.getTimeInstance(int style)
    DateFormat.getTimeInstance(int style, Locale aLocale)

Thus far, we’ve seen how to format numbers as text. Now, we’ll take a look at a class, ChoiceFormat, that maps numerical ranges to text. ChoiceFormat is constructed by specifying the numerical ranges and the strings that correspond to them. One constructor accepts an array of doubles and an array of Strings, where each string corresponds to the range running from the matching number up to (but not including) the next number in the array:

    double[] limits = new double [] {0, 20, 40};
    String[] labels = new String [] {"young", "less young", "old"};
    ChoiceFormat cf = new ChoiceFormat(limits, labels);
    System.out.println(cf.format(12)); //"young"
    System.out.println(cf.format(26)); // "less young"

You can specify both the limits and the labels using a special string in an alternative ChoiceFormat constructor:

    ChoiceFormat cf = new ChoiceFormat("0#young|20#less young|40#old");
    System.out.println(cf.format(40)); // old
    System.out.println(cf.format(50)); // old

The limit and value pairs are separated by vertical bars (|); the number sign (#) separates each limit from its corresponding value.

ChoiceFormat is most useful for handling pluralization in messages, enabling you to avoid hideous constructions such as, “you have one file(s) open.” You can create readable error messages by using ChoiceFormat along with the MessageFormat class.

MessageFormat

MessageFormat is a string formatter that uses a pattern string in the same way that printf() formatting does. MessageFormat has largely been replaced by printf(), which has more options and is more widely used outside of Java. Nonetheless, some may still prefer MessageFormat’s style, which is a bit less cryptic than that of printf(). MessageFormat has a static formatting method, MessageFormat.format(), paralleling the print-style formatting of String.format().

Arguments in a MessageFormat format string are delineated by curly brackets and may include information about how they should be formatted. Each argument consists of a number, an optional type, and an optional style, as summarized in Table 10-8.

Table 10-8. MessageFormat arguments

Type

Styles

Choice

pattern

Date

short, medium, long, full, pattern

Number

integer, percent, currency, pattern

Time

short, medium, long, full, pattern

Let’s use an example to clarify this:

    //Equivalent to String.format("You have %s messages.", "no");
    MessageFormat.format("You have {0} messages.", "no");

The special incantation {0} means “use element zero of the arguments supplied to the format() method.” When we generate a message by calling format(), we pass in values to replace the placeholders ({0}, {1}, ... ) in the template. In this case, we pass the string “no” as arguments[0], yielding the result, You have no messages.

Let’s try this example again, but this time, we’ll format a number and a date instead of a string argument:

    MessageFormat mf = new MessageFormat(
        "You have {0, number, integer} messages on {1, date, long}.");
        // "You have 93 messages on April 10, 2002."

    System.out.println( mf.format( 93, new Date() ) );

In this example, we need to fill in two spaces in the template, so we need two arguments. The first must be a number and is formatted as an integer. The second must be a Date and is printed in the long format.

This is still sloppy. What if there is only one message? To make this grammatically correct, we can embed a ChoiceFormat-style pattern string in our MessageFormat pattern string:

    MessageFormat mf = new MessageFormat(
      "You have {0, number, integer} message{0, choice, 0#s|1#|2#s}.");
    // "You have 1 message."
    System.out.println( mf.format( 1 ) );

In this case, we use the first argument twice: once to supply the number of messages and once to provide input to the ChoiceFormat pattern. The pattern says to add an s if the argument has the value 0 or is 2 or more.

When writing internationalized programs, you can use resource bundles to supply not only the text of messages, but also the format strings for your MessageFormat objects. In this way, you can automatically format messages that are in the appropriate language with dates and other language-dependent fields handled appropriately and in the appropriate order. Because arguments in the format string are numbered, you can refer to them in any location. For example, in English, you might say, “Disk C has 123 files”; in some other language, you might say, “123 files are on Disk C.” You could implement both messages with the same set of arguments:

    MessageFormat m1 = new MessageFormat(
        "Disk {0} has {1, number, integer} files.");
    MessageFormat m2 = new MessageFormat(
        "{1, number, integer} files are on disk {0}.");

In real life, the code could be more compact; you’d use only a single MessageFormat object, initialized with a string taken from a resource bundle. Or you’d likely want to use the static format method or switch to printf() entirely.

Regular Expressions

Now it’s time to take a brief detour on our trip through Java and enter the land of regular expressions. A regular expression, or regex for short, describes a text pattern. Regular expressions are used with many tools—including the java.util.regex package, text editors, and many scripting languages—to provide sophisticated text-searching and powerful string-manipulation capabilities.

If you are already familiar with the concept of regular expressions and how they are used with other languages, you may wish to skim through this section. At the very least, you’ll need to look at the “The java.util.regex API” section later in this chapter, which covers the Java classes necessary to use them. On the other hand, if you’ve come to this point on your Java journey with a clean slate on this topic and you’re wondering exactly what regular expressions are, then pop open your favorite beverage and get ready. You are about to learn about the most powerful tool in the arsenal of text manipulation and what is, in fact, a tiny language within a language, all in the span of a few pages.

Regex Notation

A regular expression describes a pattern in text. By pattern, we mean just about any feature you can imagine identifying in text from the literal characters alone, without actually understanding their meaning. This includes features, such as words, word groupings, lines and paragraphs, punctuation, case, and more generally, strings and numbers with a specific structure to them, such as phone numbers, email addresses, and quoted phrases. With regular expressions, you can search the dictionary for all the words that have the letter “q” without its pal “u” next to it, or words that start and end with the same letter. Once you have constructed a pattern, you can use simple tools to hunt for it in text or to determine if a given string matches it. A regex can also be arranged to help you dismember specific parts of the text it matched, which you could then use as elements of replacement text if you wish.

Write once, run away

Before moving on, we should say a few words about regular expression syntax in general. At the beginning of this section, we casually mentioned that we would be discussing a new language. Regular expressions do, in fact, constitute a simple form of programming language. If you think for a moment about the examples we cited earlier, you can see that something like a language is going to be needed to describe even simple patterns—such as email addresses—that have some variation in form.

A computer science textbook would classify regular expressions at the bottom of the hierarchy of computer languages, in terms of both what they can describe and what you can do with them. They are still capable of being quite sophisticated, however. As with most programming languages, the elements of regular expressions are simple, but they can be built up in combination to arbitrary complexity. And that is where things start to get sticky.

Since regexes work on strings, it is convenient to have a very compact notation that can be easily wedged between characters. But compact notation can be very cryptic, and experience shows that it is much easier to write a complex statement than to read it again later. Such is the curse of the regular expression. You may find that in a moment of late-night, caffeine-fueled inspiration, you can write a single glorious pattern to simplify the rest of your program down to one line. When you return to read that line the next day, however, it may look like Egyptian hieroglyphics to you. Simpler is generally better. If you can break your problem down and do it more clearly in several steps, maybe you should.

Escaped characters

Now that you’re properly warned, we have to throw one more thing at you before we build you back up. Not only can the regex notation get a little hairy, but it is also somewhat ambiguous with ordinary Java strings. An important part of the notation is the escaped character, a character with a backslash in front of it. For example, the escaped d character, \d, (backslash ‘d’) is shorthand that matches any single digit character (0-9). However, you cannot simply write \d as part of a Java string, because Java uses the backslash for its own special characters and to specify Unicode character sequences (\uxxxx). Fortunately, Java gives us a replacement: an escaped backslash, which is two backslashes (\\), means a literal backslash. The rule is, when you want a backslash to appear in your regex, you must escape it with an extra one:

    "\\d" // Java string that yields backslash "d"

And just to make things crazier, because regex notation itself uses backslash to denote special characters, it must provide the same “escape hatch” as well—allowing you to double up backslashes if you want a literal backslash. So if you want to specify a regular expression that includes a single literal backslash, it looks like this:

    "\\\\"  // Java string yields two backslashes; regex yields one

Most of the “magic” operator characters you read about in this section operate on the character that precedes them, so these also must be escaped if you want their literal meaning. This includes such characters as ., *, +, braces {}, and parentheses ().

If you need to create part of an expression that has lots of literal characters in it, you can use the special delimiters \Q and \E to help you. Any text appearing between \Q and \E is automatically escaped. (You still need the Java String escapes—double backslashes for backslash, but not quadruple.) There is also a static method Pattern.quote(), which does the same thing, returning a properly escaped version of whatever string you give it.

Beyond that, my only suggestion to help maintain your sanity when working with these examples is to keep two copies—a comment line showing the naked regular expression and the real Java string, where you must double up all backslashes.

Characters and character classes

Now, let’s dive into the actual regex syntax. The simplest form of a regular expression is plain, literal text, which has no special meaning and is matched directly (character for character) in the input. This can be a single character or more. For example, in the following string, the pattern “s” can match the character s in the words rose and is:

    "A rose is $1.99."

The pattern “rose” can match only the literal word rose. But this isn’t very interesting. Let’s crank things up a notch by introducing some special characters and the notion of character “classes.”

Any character: dot (.)

The special character dot (.) matches any single character. The pattern “.ose” matches rose, nose, _ose (space followed by ose) or any other character followed by the sequence ose. Two dots match any two characters, and so on. The dot operator is not discriminating; it normally stops only for an end-of-line character (and, optionally, you can tell it not to; we discuss that later).

We can consider “.” to represent the group or class of all characters. And regexes define more interesting character classes as well.

Whitespace or nonwhitespace character: \s, \S

The special character \s matches a literal-space character or one of the following characters: \t (tab), \r (carriage return), \n (newline), \f (formfeed), and backspace. The corresponding special character \S does the inverse, matching any character except whitespace.

Digit or nondigit character: \d, \D

\d matches any of the digits 0-9. \D does the inverse, matching all characters except digits.

Word or nonword character: \w, \W

\w matches a “word” character, including upper- and lowercase letters A-Z, a-z, the digits 0-9, and the underscore character (_). \W matches everything except those characters.

Custom character classes

You can define your own character classes using the notation [...]. For example, the following class matches any of the characters a, b, c, x, y, or z:

    [abcxyz]

The special x-y range notation can be used as shorthand for the alphabetic characters. The following example defines a character class containing all upper- and lowercase letters:

    [A-Za-z]

Placing a caret (^) as the first character inside the brackets inverts the character class. This example matches any character except uppercase A-F:

    [^A-F]    //  G, H, I, ..., a, b, c, ... etc.

Nesting character classes simply adds them:

    [A-F[G-Z]]   // A-Z

The && logical AND notation can be used to take the intersection (characters in common):

    [a-p&&[l-z]]  // l, m, n, o, p
    [A-Z&&[^P]]  // A through Z except P

Position markers

The pattern “[Aa] rose” (including an upper- or lowercase A) matches three times in the following phrase:

    "A rose is a rose is a rose"

Position characters allow you to designate the relative location of a match. The most important are ^ and $, which match the beginning and end of a line, respectively:

    ^[Aa] rose  // matches "A rose" at the beginning of line
    [Aa] rose$  // matches "a rose" at end of line

By default, ^ and $ match the beginning and end of “input,” which is often a line. If you are working with multiple lines of text and wish to match the beginnings and endings of lines within a single large string, you can turn on “multiline” mode as described later in this chapter.

The position markers \b and \B match a word boundary or nonword boundary, respectively. For example, the following pattern matches rose and rosemary, but not primrose:

    \brose

Iteration (multiplicity)

Simply matching fixed character patterns would not get us very far. Next, we look at operators that count the number of occurrences of a character (or more generally, of a pattern, as we’ll see in “Capture groups”):

Any (zero or more iterations): asterisk (*)

Placing an asterisk (*) after a character or character class means “allow any number of that type of character”—in other words, zero or more. For example, the following pattern matches a digit with any number of leading zeros (possibly none):

    0*\d   // match a digit with any number of leading zeros
Some (one or more iterations): plus sign (+)

The plus sign (+) means “one or more” iterations and is equivalent to XX* (pattern followed by pattern asterisk). For example, the following pattern matches a number with one or more digits, plus optional leading zeros:

    0*\d+   // match a number (one or more digits) with optional leading 
            // zeros

It may seem redundant to match the zeros at the beginning of an expression because zero is a digit and is thus matched by the \d+ portion of the expression anyway. However, we’ll show later how you can pick apart the string using a regex and get at just the pieces you want. In this case, you might want to strip off the leading zeros and keep only the digits.

Optional (zero or one iteration): question mark (?)

The question mark operator (?) allows exactly zero or one iteration. For example, the following pattern matches a credit-card expiration date, which may or may not have a slash in the middle:

    \d\d/?\d\d  // match four digits with an optional slash in the middle
Range (between x and y iterations, inclusive): {x,y}

The {x,y} curly-brace range operator is the most general iteration operator. It specifies a precise range to match. A range takes two arguments: a lower bound and an upper bound, separated by a comma. This regex matches any word with five to seven characters, inclusive:

    \b\w{5,7}\b  // match words with at least 5 and at most 7 characters
At least x or more iterations (y is infinite): {x,}

If you omit the upper bound, simply leaving a dangling comma in the range, the upper bound becomes infinite. This is a way to specify a minimum of occurrences with no maximum.

Grouping

Just as in logical or mathematical operations, parentheses can be used in regular expressions to make subexpressions or to put boundaries on parts of expressions. This power lets us extend the operators we’ve talked about to work not only on characters, but also on words or other regular expressions. For example:

    (yada)+

Here we are applying the + (one or more) operator to the whole pattern yada, not just one character. It matches yada, yadayada, yadayadayada, and so on.

Using grouping, we can start building more complex expressions. For example, while many email addresses have a three-part structure (e.g., foo@bar.com), the domain name portion can, in actuality, contain an arbitrary number of dot-separated components. To handle this properly, we can use an expression like this one:

    \w+@\w+(\.\w)+   // Match an email address

This expression matches a word, followed by an @ symbol, followed by another word and then one or more literal dot-separated words—e.g., , , or .

Capture groups

In addition to basic grouping of operations, parentheses have an important, additional role: the text matched by each parenthesized subexpression can be separately retrieved. That is, you can isolate the text that matched each subexpression. There is then a special syntax for referring to each capture group within the regular expression by number. This important feature has two uses.

First, you can construct a regular expression that refers to the text it has already matched and uses this text as a parameter for further matching. This allows you to express some very powerful things. For example, we can show the dictionary example we mentioned in the introduction. Let’s find all the words that start and end with the same letter:

    \b(\w)\w*\1\b  // match words beginning and ending with the same letter

See the 1 in this expression? It’s a reference to the first capture group in the expression, (\w). References to capture groups take the form \n where n is the number of the capture group, counting from left to right. In this example, the first capture group matches a word character on a word boundary. Then we allow any number of word characters up to the special reference \1 (also followed by a word boundary). The \1 means “the value matched in capture group one.” Because these characters must be the same, this regex matches words that start and end with the same character.

The second use of capture groups is in referring to the matched portions of text while constructing replacement text. We’ll show you how to do that a bit later when we talk about the Regular Expression API.

Capture groups can contain more than one character, of course, and you can have any number of groups. You can even nest capture groups. Next, we discuss exactly how they are numbered.

Numbering

Capture groups are numbered, starting at 1, and moving from left to right, by counting the number of open parentheses it takes to reach them. The special group number 0 always refers to the entire expression match. For example, consider the following string:

    one ((two) (three (four)))

This string creates the following matches:

    Group 0: one two three four
    Group 1: two three four
    Group 2: two
    Group 3: three four
    Group 4: four

Before going on, we should note one more thing. So far in this section we’ve glossed over the fact that parentheses are doing double duty: creating logical groupings for operations and defining capture groups. What if the two roles conflict? Suppose we have a complex regex that uses parentheses to group subexpressions and to create capture groups? In that case, you can use a special noncapturing group operator (?:) to do logical grouping instead of using parentheses. You probably won’t need to do this often, but it’s good to know.

Alternation

The vertical bar (|) operator denotes the logical OR operation, also called alternation or choice. The | operator does not operate on individual characters but instead applies to everything on either side of it. It splits the expression in two unless constrained by parentheses grouping. For example, a slightly naive approach to parsing dates might be the following:

    \w+, \w+ \d+ \d+|\d\d/\d\d/\d\d  // pattern 1 or pattern 2

In this expression, the left matches patterns such as Fri, Oct 12, 2001, and the right matches 10/12/2001.

The following regex might be used to match email addresses with one of three domains (net, edu, and gov):

    \w+@[\w\.]*\.(net|edu|gov)  // email address ending in .net, .edu, or .gov

Special options

There are several special options that affect the way the regex engine performs its matching. These options can be applied in two ways:

  • You can pass in one or more flags during the Pattern.compile() step (discussed later in this chapter).

  • You can include a special block of code in your regex.

We’ll show the latter approach here. To do this, include one or more flags in a special block (?x), where x is the flag for the option we want to turn on. Generally, you do this at the beginning of the regex. You can also turn off flags by adding a minus sign (?-x), which allows you to apply flags to select parts of your pattern.

The following flags are available:

Case-insensitive: (?i)

The (?i) flag tells the regex engine to ignore case while matching, for example:

    (?i)yahoo   // match Yahoo, yahoo, yahOO, etc.
Dot all: (?s)

The (?s) flag turns on “dot all” mode, allowing the dot character to match anything, including end-of-line characters. It is useful if you are matching patterns that span multiple lines. The s stands for “single-line mode,” a somewhat confusing name derived from Perl.

Multiline: (?m)

By default, ^ and $ don’t really match the beginning and end of lines (as defined by carriage return or newline combinations); they instead match the beginning or end of the entire input text. Turning on multiline mode with (?m) causes them to match the beginning and end of every line as well as the beginning and end of input. Specifically, this means the spot before the first character, the spot after the last character, and the spots just after and before line terminators inside the string.

Unix lines: (?d)

The (?d) flag limits the definition of the line terminator for the ^, $, and . special characters to Unix-style newline only (\n). By default, carriage return newline (\r\n) is also allowed.

Greediness

We’ve seen hints that regular expressions are capable of sorting some complex patterns. But there are cases where what should be matched is ambiguous (at least to us, though not to the regex engine). Probably the most important example has to do with the number of characters the iterator operators consume before stopping. The .* operation best illustrates this. Consider the following string:

    "Now is the time for <bold>action</bold>, not words."

Suppose we want to search for all the HTML-style tags (the parts between the < and > characters), perhaps because we want to remove them.

We might naively start with this regex:

    </?.*>  // match <, optional /, and then anything up to >

We then get the following match, which is much too long:

    <bold>action</bold>

The problem is that the .* operation, like all the iteration operators, is by default “greedy,” meaning that it consumes absolutely everything it can, up until the last match for the terminating character (in this case, >) in the file or line.

There are solutions for this problem. The first is to “say what it is”—that is, to be specific about what is allowed between the braces. The content of an HTML tag cannot actually include anything; for example, it cannot include a closing bracket (>). So we could rewrite our expression as:

    </?\w*>  // match <, optional /, any number of word characters, then >

But suppose the content is not so easy to describe. For example, we might be looking for quoted strings in text, which could include just about any text. In that case, we can use a second approach and “say what it is not.” We can invert our logic from the previous example and specify that anything except a closing bracket is allowed inside the brackets:

    </?[^>]*>

This is probably the most efficient way to tell the regex engine what to do. It then knows exactly what to look for to stop reading. This approach has limitations, however. It is not obvious how to do this if the delimiter is more complex than a single character. It is also not very elegant.

Finally, we come to our general solution: the use of “reluctant” operators. For each of the iteration operators, there is an alternative, nongreedy form that consumes as few characters as possible, while still trying to get a match with what comes after it. This is exactly what we needed in our previous example.

Reluctant operators take the form of the standard operator with a “?” appended. (Yes, we know that’s confusing.) We can now write our regex as:

    </?.*?> // match <, optional /, minimum number of any chars, then >

We have appended ? to .* to cause .* to match as few characters as possible while still making the final match of >. The same technique (appending the ?) works with all the iteration operators, as in the two following examples:

    .+?   // one or more, nongreedy
    .{x,y}?  // between x and y, nongreedy

Lookaheads and lookbehinds

In order to understand our next topic, let’s return for a moment to the position marking characters (^, $, \b, and \B) that we discussed earlier. Think about what exactly these special markers do for us. We say, for example, that the \b marker matches a word boundary. But the word “match” here may be a bit too strong. In reality, it “requires” a word boundary to appear at the specified point in the regex. Suppose we didn’t have \b; how could we construct it? Well, we could try constructing a regex that matches the word boundary. It might seem easy, given the word and nonword character classes (\w and \W):

    \w\W|\W\w  // match the start or end of a word

But now what? We could try inserting that pattern into our regular expressions wherever we would have used \b, but it’s not really the same. We’re actually matching those characters, not just requiring them. This regular expression matches the two characters composing the word boundary in addition to whatever else matches afterward, whereas the \b operator simply requires the word boundary but doesn’t match any text. The distinction is that \b isn’t a matching pattern but a kind of lookahead. A lookahead is a pattern that is required to match next in the string, but is not consumed by the regex engine. When a lookahead pattern succeeds, the pattern moves on, and the characters are left in the stream for the next part of the pattern to use. If the lookahead fails, the match fails (or it backtracks and tries a different approach).

We can make our own lookaheads with the lookahead operator (?=). For example, to match the letter X at the end of a word, we could use:

    (?=\w\W)X  // Find X at the end of a word

Here the regex engine requires the \W\w pattern to match but not consume the characters, leaving them for the next part of the pattern. This effectively allows us to write overlapping patterns (like the previous example). For instance, we can match the word “Pat” only when it’s part of the word “Patrick,” like so:

    (?=Patrick)Pat  // Find Pat only in Patrick

Another operator, (?!), the negative lookahead, requires that the pattern not match. We can find all the occurrences of Pat not inside of a Patrick with this:

    (?!Patrick)Pat  // Find Pat never in Patrick

It’s worth noting that we could have written all of these examples in other ways, by simply matching a larger amount of text. For instance, in the first example we could have matched the whole word “Patrick.” But that is not as precise, and if we wanted to use capture groups to pull out the matched text or parts of it later, we’d have to play games to get what we want. For example, suppose we wanted to substitute something for Pat (say, change the font). We’d have to use an extra capture group and replace the text with itself. Using lookaheads is easier.

In addition to looking ahead in the stream, we can use the (?<=) and (?<!)lookbehind operators to look backward in the stream. For example, we can find my last name, but only when it refers to me:

    (?<=Pat )Niemeyer  // Niemeyer, only when preceded by Pat

Or we can find the string “bean” when it is not part of the phrase “Java bean”:

    (?<!Java *)bean   // The word bean, not preceded by Java

In these cases, the lookbehind and the matched text didn’t overlap because the lookbehind was before the matched text. But you can place a lookahead or lookbehind at either point—before or after the match—for example, we could also match Pat Niemeyer like this:

    Niemeyer(?<=Pat Niemeyer)

The java.util.regex API

Now that we’ve covered the theory of how to construct regular expressions, the hard part is over. All that’s left is to investigate the Java API for applying regexes: searching for them in strings, retrieving captured text, and replacing matches with substitution text.

Pattern

As we’ve said, the regex patterns that we write as strings are, in actuality, little programs describing how to match text. At runtime, the Java regex package compiles these little programs into a form that it can execute against some target text. Several simple convenience methods accept strings directly to use as patterns. More generally, however, Java allows you to explicitly compile your pattern and encapsulate it in an instance of a Pattern object. This is the most efficient way to handle patterns that are used more than once, because it eliminates needlessly recompiling the string. To compile a pattern, we use the static method Pattern.compile():

    Pattern urlPattern = Pattern.compile("\\w+://[\\w/]*");

Once you have a Pattern, you can ask it to create a Matcher object, which associates the pattern with a target string:

    Matcher matcher = urlPattern.matcher( myText );

The matcher executes the matches. We’ll talk about that next. But before we do, we’ll just mention one convenience method of Pattern. The static method Pattern.matches() simply takes two strings—a regex and a target string—and determines if the target matches the regex. This is very convenient if you want to do a quick test once in your application. For example:

    Boolean match = Pattern.matches( "\\d+\\.\\d+f?", myText );

This line of code can test if the string myText contains a Java-style floating-point number such as “42.0f.” Note that the string must match completely in order to be considered a match.

The Matcher

A Matcher associates a pattern with a string and provides tools for testing, finding, and iterating over matches of the pattern against it. The Matcher is “stateful.” For example, the find() method tries to find the next match each time it is called. But you can clear the Matcher and start over by calling its reset() method.

If you’re just interested in “one big match”—that is, you’re expecting your string to either match the pattern or not—you can use matches() or lookingAt(). These correspond roughly to the methods equals() and startsWith() of the String class. The matches() method asks if the string matches the pattern in its entirety (with no string characters left over) and returns true or false. The lookingAt() method does the same, except that it asks only whether the string starts with the pattern and doesn’t care if the pattern uses up all the string’s characters.

More generally, you’ll want to be able to search through the string and find one or more matches. To do this, you can use the find() method. Each call to find() returns true or false for the next match of the pattern and internally notes the position of the matching text. You can get the starting and ending character positions with the Matcher start() and end() methods, or you can simply retrieve the matched text with the group() method. For example:

    import java.util.regex.*;

    String text="A horse is a horse, of course of course...";
    String pattern="horse|course";

    Matcher matcher = Pattern.compile( pattern ).matcher( text );
    while ( matcher.find() )
      System.out.println(
        "Matched: '"+matcher.group()+"' at position "+matcher.start() );

The previous snippet prints the starting location of the words “horse” and “course” (four in all):

    Matched: 'horse' at position 2
    Matched: 'horse' at position 13
    Matched: 'course' at position 23
    Matched: 'course' at position 33

The method to retrieve the matched text is called group() because it refers to capture group zero (the entire match). You can also retrieve the text of other numbered capture groups by giving the group() method an integer argument. You can determine how many capture groups you have with the groupCount() method:

    for (int i=1; i < matcher.groupCount(); i++)
    System.out.println( matcher.group(i) );

Splitting and tokenizing strings

A very common need is to parse a string into a bunch of fields based on some delimiter, such as a comma. It’s such a common problem that in Java 1.4, a method was added to the String class for doing just this. The split() method accepts a regular expression and returns an array of substrings broken around that pattern. For example:

    String text = "Foo, bar ,   blah";
    String [] fields = text.split( "\s*,\s*" );

yields a String array containing Foo, bar, and blah. You can control the maximum number of matches and also whether you get “empty” strings (for text that might have appeared between two adjacent delimiters) using an optional limit field.

If you are going to use an operation like this more than a few times in your code, you should probably compile the pattern and use its split() method, which is identical to the version in String. The String split() method is equivalent to:

    Pattern.compile(pattern).split(string);

Another look at Scanner

As we mentioned when we introduced it, the Scanner class in Java 5.0 can use regular expressions to tokenize strings. You can specify a regular expression to use as the delimiter (instead of the default whitespace) either at construction time or with the useDelimiter() method. The Scanner next(), hasNext(), skip(), and findInLine() methods all take regular expressions as well. You can specify these either as strings or with a compiled Pattern object.

You can use the findInLine() method of Scanner as an improved Matcher. For example:

    Scanner scanner = new Scanner( "Quantity: 42 items, Price $2.34" );
    scanner.findInLine("[Qq]uantity[:\\s]*");
    int quantity=scanner.nextInt();
    scanner.findInLine("[Pp]rice.*\\$");
    float price=scanner.nextFloat();

The previous snippet locates the quantity and price values, allowing for variations in capitalization and spacing before the numbers.

Before we move on, we’ll also mention a “Stupid Scanner Trick” that, although we don’t recommend it, you might find amusing. Using the \A boundary marker, which denotes the beginning of input, as a delimiter, we can tell the Scanner to return the whole input as a single string. This is an easy way to read the contents of any stream into one large string:

    InputStream source  = new URL("http://www.oreilly.com/").openStream();
    String text = new Scanner( source ).useDelimiter("\\A").next();

This is probably not the most efficient or understandable way to do it, but it may save you a little typing in your experimentation.

Replacing text

A common reason that you’ll find yourself searching for a pattern in a string is to change it to something else. The regex package not only makes it easy to do this but also provides a simple notation to help you construct replacement text using bits of the matched text.

The most convenient form of this API is Matcher’s replaceAll() method, which substitutes a replacement string for each occurrence of the pattern and returns the result. For example:

    String text = "Richard Nixon's social security number is: 567-68-0515.";
    Matcher matcher =
    Pattern.compile("\\d\\d\\d-\\d\\d\-\\d\\d\\d\\d").matcher( text );
    String output = matcher.replaceAll("XXX-XX-XXXX");

This code replaces all occurrences of U.S. government Social Security numbers with “XXX-XX-XXXX” (perhaps for privacy considerations).

Using captured text in a replacement

. Literal substitution is nice, but we can make this more powerful by using capture groups in our substitution pattern. To do this, we use the simple convention of referring to numbered capture groups with the notation $n, where n is the group number. For example, suppose we wanted to show just a little of the Social Security number in the previous example, so that the user would know if we were talking about him. We could modify our regex to catch, for example, the last four digits like so:

    \d\d\d-\d\d-(\d\d\d\d)

We can then use that in the substitution text:

    String output = matcher.replaceAll("XXX-XX-$1");

The static method Matcher.quoteReplacement() can be used to escape a literal string (so that it ignores the $ notation) before using it as replacement text.

Controlling the substitution

The replaceAll() method is useful, but you may want more control over each substitution. You may want to change each match to something different or base the change on the match in some programmatic way.

To do this, you can use the Matcher appendReplacement() and appendTail() methods. These methods can be used in conjunction with the find() method as you iterate through matches to build a replacement string. appendReplacement() and appendTail() operate on a StringBuffer that you supply. The appendReplacement() method builds a replacement string by keeping track of where you are in the text and appending all nonmatched text to the buffer for you as well as the substitute text that you supply. Each call to find() appends the intervening text from the last call, followed by your replacement, then skips over all the matched characters to prepare for the next one. Finally, when you have reached the last match, you should call appendTail(), which appends any remaining text after the last match. We’ll show an example of this next, as we build a simple “template engine.”

Our simple template engine

Let’s tie what we’ve discussed together in a nifty example. A common problem in Java applications is working with bulky, multiline text. In general, you don’t want to store the text of messages in your application code because it makes them difficult to edit or internationalize. But when you move them to external files or resources, you need a way for your application to plug in information at runtime. The best example of this is in Java servlets; a generated HTML page is often 99% static text with only a few “variable” pieces plugged in. Technologies such as JSP and XSL were developed to address this. But these are big tools, and we have a simple problem. So let’s create a simple solution—a template engine.

Our template engine reads text containing special template tags and substitutes values that we provide. And because generating HTML or XML is one of the most important applications of this, we’ll be friendly to those formats by making our tags conform to the style of an XML comment. Specifically, our engine searches the text for tags that look like this:

    <!--TEMPLATE:name  This is the template for the user name -->

XML-style comments start with <!— and can contain anything up to a closing —>. We’ll add the convention of requiring a TEMPLATE:name field to specify the name of the value we want to use. Aside from that, we’ll still allow any descriptive text the user wants to include. To be friendly (and consistent), we’ll allow any amount of whitespace to appear in the tags, including multiline text in the comments. We’ll also ignore the text case of the “TEMPLATE” identifier, just in case. Now, we could do this all with low-level String commands, looping over whitespace and taking many substrings. But using the power of regexes, we can do it much more cleanly and with only about seven lines of relevant code. (We’ve rounded out the example with a few more to make it more useful.)

    import java.util.*;
    import java.util.regex.*;


    public class Template
    {
        Properties values = new Properties();
        Pattern templateComment =
            Pattern.compile("(?si)<!--\\s*TEMPLATE:(\\w+).*?-->");

        public void set( String name, String value ) {
            values.setProperty( name, value );
        }

        public String fillIn( String text ) {
            Matcher matcher = templateComment.matcher( text );

            StringBuffer buffer = new StringBuffer();
            while( matcher.find() ) {
                String name = matcher.group(1);
                String value = values.getProperty( name );
                matcher.appendReplacement( buffer, value );
            }
            matcher.appendTail( buffer );
            return buffer.toString();
        }
    }

You’d use the Template class like this:

    String input = "<!-- TEMPLATE:name --> lives at "
       +"<!-- TEMPLATE:address -->";
    Template template = new Template();
    template.set("name", "Bob");
    template.set("address", "1234 Main St.");
    String output = template.fillIn( input );

In this code, input is a string containing tags for name and address. The set() method provides the values for those tags.

Let’s start by picking apart the regex, templatePattern, in the example:

    (?si)<!--\s*TEMPLATE:(\w+).*?-->

It looks scary, but it’s actually very simple. Just start reading from left to right. First, we have the special flags declaration (?si) telling the regex engine that it should be in single-line mode, with .* matching all characters including newlines (s), and ignoring case (i). Next, there is the literal <!— followed by any amount of whitespace (\s) and the TEMPLATE: identifier. After the colon, we have a capture group (\w+), which reads our name identifier and saves it for us to retrieve later. We allow anything (.*) up to the —>, being careful to specify that .* should be nongreedy (.*?). We don’t want .* to consume other opening and closing comment tags all the way to the last one, but instead to find the smallest match (one tag).

Our fillIn() method does the work, accepting a template string, searching it, and “replacing” the tag values with the values from set(), which we have stored in a Properties table. Each time fillIn() is called, it creates a Matcher to wrap the input string and get ready to apply the pattern. It then creates a temporary StringBuffer to hold the output and loops, using the Matcher find() method to get each tag. For each match, it retrieves the value of the capture group (group one) that holds the tag name. It looks up the corresponding value and replaces the tag with this value in the output string buffer using the appendReplacement() method. (Remember that appendReplacement() fills in the intervening text on each call, so we don’t have to.) All that remains is to call appendTail() at the end to get the remaining text after the last match and return the string value. That’s it!

We hope this section has shown you some of the power provided by these tools and whetted your appetite for more. Regexes allow you to work in ways you may not have considered before. Especially now, when the software world is focused on textual representations of almost everything—from data to user interfaces—via XML and HTML, having powerful text-manipulation tools is fundamental. Just remember to keep those regexes simple so you can reuse them again and again.



[29] When in doubt, measure it! If your String-manipulating code is clean and easy to understand, don’t rewrite it until someone proves to you that it is too slow. Chances are that they will be wrong. And don’t be fooled by relative comparisons. A millisecond is 1,000 times slower than a microsecond, but it still may be negligible to your application’s overall performance.

[30] On Mac OS X, the default encoding is MacRoman. In Windows, it is CP1252. On some Unix platforms it is ISO8859_1.