3. SQL Basics

Sections
SQL Syntax
SQL Standards and Conformance
Identifiers
Data Types
Character String Types
Binary Large Object Type
Exact Numeric Types
Approximate Numeric Types
Boolean Type
Datetime Types
Interval Types
Unique Identifiers
Other Data Types
Nulls

SQL is based on the relational model but doesn’t implement it faithfully. One departure from the model is that in SQL, primary keys are optional rather than mandatory. Consequently, tables without keys will accept duplicate rows, rendering some data inaccessible. A complete review of the many disparities is beyond the scope of this book (see the “Learning database design” section in Chapter 2). The upshot of these discrepancies is that DBMS users, and not the DBMS itself, are responsible for enforcing a relational structure. Another result is that the Model and SQL terms in Table 2.1 in Chapter 2 aren’t interchangeable.

With that warning, it’s time to learn SQL. An SQL program is a sequence of SQL statements executed in order. To write a program, you must know the rules that govern SQL syntax. This chapter explains how to write valid SQL statements and also covers data types and nulls.

SQL Syntax

Figure 3.1 shows the syntax of an example SQL statement (don’t worry about the actual meaning, or semantics, of the statement).

Figure 3.1An SQL statement, with a comment.

Diagram: An SQL statement and comment with parts labeled sequentially

  1. Comment.A comment is optional text that explains your program. Comments usually describe what a program does and how, or why code was changed. Comments are for humans—the DBMS ignores them. A comment is introduced by two consecutive hyphens and continues until the end of the line.
  2. SQL statement.An SQL statement is a valid combination of tokens introduced by a keyword. Tokens are the basic indivisible particles of the SQL language; they can’t be reduced grammatically. Tokens include keywords, identifiers, operators, literals (constants), and punctuation symbols.
  3. Clauses.An SQL statement has one or more clauses. In general, a clause is a fragment of an SQL statement that’s introduced by a keyword, is required or optional, and must be given in a particular order. SELECT, FROM, WHERE, and ORDER BY introduce the four clauses in this example.
  4. Keywords.A keyword is a word that SQL reserves because it has special meaning in the language. Using a keyword outside its specific context (as an identifier, for example) causes an error. DBMSs use a mix of standard and nonstandard keywords; search your DBMS documentation for keywords or reserved words. The keywords in this example are SELECT, FROM, WHERE, ORDER, and BY.
  5. Identifiers.Identifiers are words that you (or the database designer) use to name database objects such as tables, columns, aliases, indexes, and views. The identifiers in this example are au_fname, au_lname, authors, and state. For more information, see “Identifiers” later in this chapter.
  6. Terminating semicolon.An SQL statement ends with a semicolon.

SQL is a free-form language whose statements can:

Despite this flexibility, you should adopt a consistent style (Figure 3.2). I use uppercase keywords and lowercase identifiers and indent each clause on its own line; see “About This Book” for my typographic and syntax conventions.

Figure 3.2There aren’t many rules about how to format an SQL statement. This statement is equivalent to the one in Figure 3.1.

select au_fname
  ,        AU_LNAME
             FROM
 authors WhErE    state
= 'NY' order
             bY
AU_lnamE
      ;

Tips for SQL Syntax

Common Errors

Some common SQL programming errors are:

These errors usually are easy to catch and correct, even if your DBMS returns an obscure or unhelpful error message. Remember that the real error actually can occur well before the statement the DBMS flags as an error. For example, if you run

CREATE TABLE misspelled_name

then your DBMS will straightaway create a table with the bad name. Your error won’t show up until later, when you try to reference the table with, say,

SELECT * FROM correct_name

Tip: A common way to test and fix (debug) SQL programs is to use comments to temporarily stop SQL code from being executed. If you’re working on a long SQL statement and want to test only part of it, then you can comment out some of the code so that the DBMS sees it as comments and ignores it.

SQL Standards and Conformance

The ISO SQL technical committee has been revising the official SQL standard every few years since 1986. Each revision:

The standard is enormous—thousands of pages of dense specifications—and no vendor conforms (or ever will conform) to the entire thing. Instead, vendors try to conform to a subset of the standard called Core SQL. This level of conformance is the minimal category that vendors have to achieve to claim that they conform to standard SQL. The SQL-92 revision introduced levels of conformance, and later standards have them too, so when you read a DBMS’s conformance statement, note which SQL standard it’s referring to and which level. In fact, SQL-92 often is thought of as the standard because it defined many of the most vital and unchanging parts of the language. Except where noted, the SQL elements in this book are part of SQL-92 as well as later standards. The lowest level of SQL-92 conformance is called Entry (not Core).

Your programs should follow the SQL standard as closely as possible. Ideally, you should be able to write portable SQL programs without even knowing which DBMS you’re programming for. Unfortunately, the SQL committee is not made up of language theorists and relational-model purists but is top-heavy with commercial DBMS vendors, all jockeying and maneuvering. The result is that each DBMS vendor devotes resources to approach minimal Entry or Core SQL conformance requirements and then scampers off to add nonstandard features that differentiate their products in the marketplace—meaning that your SQL programs won’t be portable. These vendor-specific lock-ins often force you to modify or rewrite SQL programs to run on different DBMSs.

Of the DBMSs covered in this book, PostgreSQL is the “purest” with respect to the SQL standard. Your DBMS might offer settings that make it better conform to the SQL standard. MySQL has ANSI mode, for example, and Microsoft SQL Server has SET ANSI_DEFAULTS ON.

Tip: For information about using SQL across different versions of the same DBMS, see “Older DBMSs and Backward Compatibility of SQL”.

Identifiers

An identifier is a name that lets you refer to an object unambiguously within the hierarchy of database objects (whether a schema, database, column, key, index, view, constraint, or anything created with a CREATE statement). An identifier must be unique within its scope, which defines where and when it can be referenced. In general:

These rules let you duplicate names for objects whose scopes don’t overlap. You can give the same name to columns in different tables, for example, or to tables in different databases.

Tip: For information about addressing database objects, see Table 2.2 in Chapter 2.

DBMS scopes vary in the extent to which they require identifier names to be unique. Microsoft SQL Server requires an index name to be unique for only its table, for example, whereas Oracle and Db2 require an index name to be unique throughout the database. Search your DBMS documentation for identifiers or names.

Standard SQL has the following identifier rules for names:

Standard SQL distinguishes between reserved and non-reserved keywords. You can’t use reserved keywords as identifiers because they have special meaning in SQL. You can’t name a table “select” or a column “sum”, for example. Non-reserved keywords have a special meaning in only some contexts and can be used as identifiers in other contexts. Most non-reserved keywords actually are the names of built-in tables and functions, so it’s safest never to use them as identifiers either.

You can use a quoted identifier, also called a delimited identifier, to break some of SQL’s identifier rules. A quoted identifier is a name surrounded by double quotes. The name can contain spaces and special characters, is case sensitive, and can be a reserved keyword. Quoted identifiers can annoy other programmers and cause problems with third-party and even a vendor’s own tools, so using them usually is a bad idea.

Here’s some more advice for choosing identifier names:

Although you can’t use (unquoted) reserved words as identifiers, you can embed them in identifiers. group and max are illegal identifiers, for example, but groups and max_price are valid. If you’re worried that your identifier might be a reserved word in some other SQL dialect, then just add an underscore to the end of the name (element_, for example); no reserved keyword ends with an underscore.

You can surround Microsoft SQL Server quoted identifiers with double quotes or brackets ([]); brackets are preferred. In Db2, you can use reserved words as identifiers (but doing so isn’t a good idea because your program won’t be portable). MySQL ANSI_QUOTES mode allows double-quoted identifiers. DBMSs have their own nonstandard keywords; search your DBMS documentation for keywords or reserved words.

In MySQL, the case sensitivity of the underlying operating system determines the case sensitivity of database and table names.

The SQL standard directs DBMSs to convert identifier names to uppercase internally. So in the guts of your SQL compiler, the unquoted identifier myname is equivalent to the quoted identifier "MYNAME" (not "myname"). PostgreSQL doesn’t conform to the standard and converts to lowercase. To write portable programs, always quote a particular name or never quote it (don’t mix them). DBMSs aren’t consistent when it comes to case sensitivity, so the best practice is always to respect case for user-defined identifiers.

Data Types

Recall from “Tables, Columns, and Rows” in Chapter 2 that a domain is the set of valid values allowed in a column. To define a domain, you use a column’s data type (and constraints, described in Chapter 11). A data type, or column type, has these characteristics:

Tips for Data Types

Character String Types

Use character string data types to represent text. A character string, or simply string, has these characteristics:

Table 3.4Character String Types
Type Description
CHARACTER Represents a fixed number of characters. A string stored in a column defined as CHARACTER(length) can have up to length characters, where length is an integer greater than or equal to 1; the maximum length depends on the DBMS. When you store a string with fewer than length characters in a CHARACTER(length) column, the DBMS pads the end of the string with spaces to create a string that has exactly length characters. A CHARACTER(6) string 'Jack' is stored as 'Jack  ', for example. CHARACTER and CHAR are synonyms.
CHARACTER VARYING Represents a variable number of characters. A string stored in a column defined as CHARACTER VARYING(length) can have up to length characters, where length is an integer greater than or equal to 1; the maximum length depends on the DBMS. Unlike CHARACTER, when you store a string with fewer than length characters in a CHARACTER VARYING(length) column, the DBMS stores the string as is and doesn’t pad it with spaces. A CHARACTER VARYING(6) string 'Jack' is stored as 'Jack', for example. CHARACTER VARYING, CHAR VARYING, and VARCHAR are synonyms.
NATIONAL CHARACTER This data type is the same as CHARACTER except that it holds standardized multibyte characters or Unicode characters. In SQL statements, NATIONAL CHARACTER strings are written like CHARACTER strings but have an N in front of the first quote: N'βæþ', for example. NATIONAL CHARACTER, NATIONAL CHAR, and NCHAR are synonyms.
NATIONAL CHARACTER VARYING This data type is the same as CHARACTER VARYING except that it holds standardized multibyte characters or Unicode characters (see NATIONAL CHARACTER). NATIONAL CHARACTER VARYING, NATIONAL CHAR VARYING, and NCHAR VARYING are synonyms.
CLOB The character large object (CLOB) type is intended for use in library databases that hold vast amounts of text. A single CLOB value might hold an entire webpage, book, or genetic sequence, for example. CLOBs can’t be used as keys or in indexes and support fewer functions and operations than do CHAR and VARCHAR. In host languages, CLOBs are referenced with a unique locator (pointer) value, avoiding the overhead of transferring entire CLOBs across a client–server network. CLOB and CHARACTER LARGE OBJECT are synonyms.
NCLOB The national character large object (NCLOB) type is the same as CLOB except that it holds standardized multibyte characters or Unicode characters (see NATIONAL CHARACTER). NCLOB, NCHAR LARGE OBJECT, and NATIONAL CHARACTER LARGE OBJECT are synonyms.

Tips for Character Strings

Table 3.5DBMS Character String Types
DBMS Types
Access short text, long text (in older versions: text, memo)
SQL Server char, varchar, nchar, nvarchar
Oracle char, varchar2, clob, nchar, nvarchar2, nclob
Db2 char, varchar, clob, nchar, nvarchar, nclob, graphic, vargraphic, dbclob
MySQL char, varchar, nchar, nvarchar, tinytext, text, mediumtext, longtext
PostgreSQL char, varchar, text

Unicode

Computers store characters (letters, digits, punctuation, control characters, and other symbols) internally by assigning them unique numeric values. An encoding determines the mapping of characters to numeric values; different languages and computer operating systems use many different native encodings. Standard U.S.-English strings use ASCII encoding, which assigns values to 128 (27) different characters—not much, and not even enough to hold all the Latin characters used in modern European languages, much less all the Chinese ideographs.

Unicode is a single character set that represents the characters of almost all the world’s written languages. Unicode can encode up to about 4.3 billion (232) characters (using UTF-32 encoding). The Unicode Consortium develops and maintains the Unicode standard. The actual Unicode mappings are available in the latest online or printed edition of The Unicode Standard, available at unicode.org.

Binary Large Object Type

Use the binary large object (BLOB) data type to store binary data. A BLOB has these characteristics:

Tips for Binary Large Objects

Table 3.6DBMS BLOB Types
DBMS Types
Access ole object, attachment
SQL Server binary, varbinary
Oracle raw, long raw, blob, bfile
Db2 binary, varbinary, blob
MySQL binary, varbinary, tinyblob, blob, mediumblob, longblob
PostgreSQL bytea

Exact Numeric Types

Use exact numeric data types to represent exact numerical values. An exact numerical value has these characteristics:

Table 3.7Exact Numeric Types
Type Description
NUMERIC Represents a decimal number, stored in a column defined as NUMERIC(precision [,scale]). precision is greater than or equal to 1; the maximum precision depends on the DBMS. scale is a value from 0 to precision. If scale is omitted, then it defaults to zero (which makes the number effectively an INTEGER).
DECIMAL This data type is similar to NUMERIC, and some DBMSs define them equivalently. The difference is that the DBMS can choose a precision greater than that specified by DECIMAL(precision [,scale]), so precision specifies the minimum precision, not an exact precision as in NUMERIC. DECIMAL and DEC are synonyms.
INTEGER Represents an integer. The minimum and maximum values that can be stored in an INTEGER column depend on the DBMS. INTEGER takes no arguments. INTEGER and INT are synonyms.
SMALLINT This data type is the same as INTEGER except that it might hold a smaller range of values, depending on the DBMS. SMALLINT takes no arguments.
BIGINT This data type is the same as INTEGER except that it might hold a larger range of values, depending on the DBMS. BIGINT takes no arguments.

Tips for Exact Numeric Types

Table 3.9DBMS Exact Numeric Types
DBMS Types
Access byte, decimal, integer, long integer
SQL Server tinyint, smallint, int, bigint, bit, numeric, decimal, smallmoney, money
Oracle number
Db2 smallint, integer, bigint, decimal, numeric
MySQL tinyint, smallint, mediumint, int, bigint, decimal, numeric
PostgreSQL smallint, integer, bigint, decimal, numeric

Approximate Numeric Types

Use approximate numeric data types to represent approximate numerical values. An approximate numerical value has these characteristics:

Table 3.10Approximate Numeric Types
Type Description
FLOAT Represents a floating-point number, stored in a column defined as FLOAT(precision). precision is greater than or equal to 1 and expressed as the number of bits (not the number of digits); the maximum precision depends on the DBMS.
REAL This data type is the same as FLOAT except that the DBMS defines the precision. REAL numbers usually are called single-precision numbers. REAL takes no arguments.
DOUBLE PRECISION This data type is the same as FLOAT except that the DBMS defines the precision, which must be greater than that of REAL. DOUBLE PRECISION takes no arguments.

Tips for Approximate Numeric Types

Table 3.11DBMS Approximate Numeric Types
DBMS Types
Access single, double
SQL Server float, real
Oracle float, binary_float, binary_double
Db2 real, double, float, decfloat
MySQL float, double
PostgreSQL real, double precision

Boolean Type

Use the boolean data type to store truth values. A boolean value has these characteristics:

Tips for Boolean Values

Table 3.12DBMS Boolean Types
DBMS Types
Access yes/no
SQL Server bit
Oracle number(1)
Db2 decimal(1)
MySQL boolean, bit, tinyint(1)
PostgreSQL boolean

Datetime Types

Use datetime data types to represent the date and time of day. A datetime value has these characteristics:

Table 3.13Datetime Types
Type Description
DATE Represents a date. A date stored in a column defined as DATE has three integer fields—YEAR, MONTH, and DAY—and is formatted yyyy-mm-dd (length 10) (2006-03-17, for example). Table 3.14 lists the valid values for the fields. DATE takes no arguments.
TIME Represents a time of day. A time stored in a column defined as TIME has three fields—HOUR, MINUTE, and SECOND—and is formatted hh:mm:ss (length 8) (22:06:57, for example). You can specify fractional seconds with TIME(precision). precision is the number of fractional digits and is greater than or equal to zero. The maximum precision, which is at least 6, depends on the DBMS. HOUR and MINUTE are integers, and SECOND is a decimal number. The format is hh:mm:ss.ssss... (length 9 plus the number fractional digits) (22:06:57.1333, for example). Table 3.14 lists the valid values for the fields.
TIMESTAMP Represents a combination of DATE and TIME values separated by a space. The TIMESTAMP format is yyyy-mm-dd hh:mm:ss (length 19) (2006-03-17 22:06:57, for example). You can specify fractional seconds with TIMESTAMP(precision). The format is yyyy-mm-dd hh:mm:ss.ssss... (length 20 plus the number fractional digits).
TIME WITH TIMEZONE This data type is the same as TIME except that it adds a field, TIME_ZONE_OFFSET, to indicate the offset in hours from UTC. TIME_ZONE_OFFSET is formatted as INTERVAL HOUR TO MINUTE (see “Interval Types” later in this chapter) and can contain the values listed in Table 3.14. Append AT TIME ZONE time_zone_offset to the TIME to assign a value to the time zone (22:06:57 AT TIME ZONE -08:00, for example). Alternatively, you can append AT LOCAL to indicate that the time zone is the default for the session (22:06:57 AT LOCAL, for example). If the AT clause is omitted, then all times default to AT LOCAL.
TIMESTAMP WITH TIMEZONE This data type is the same as TIMESTAMP except that it adds a field, TIME_ZONE_OFFSET, to indicate the offset in hours from UTC. The syntax rules are the same as those of TIME WITH TIME ZONE except that you must include a date (2006-03-17 22:06:57 AT TIME ZONE -08:00, for example).
Table 3.14Valid Values for Datetime Fields
Type Description
YEAR 0001 to 9999
MONTH 01 to 12
DAY 01 to 31
HOUR 00 to 23
MINUTE 00 to 59
SECOND 00 to 61.999
TIME_ZONE_OFFSET -12:59 to +13:00

Tips for Datetime Values

Table 3.15DBMS Datetime Types
DBMS Types
Access date/time
SQL Server date, datetime, datetime2, datetimeoffset, smalldatetime, time
Oracle date, timestamp
Db2 date, time, timestamp
MySQL date, datetime, timestamp, time, year
PostgreSQL date, time, timestamp

Interval Types

DBMS conformance to standard SQL interval types is spotty or nonexistent, so you might not find this section to be useful in practice. DBMSs have their own extended data types and functions that calculate intervals and perform date and time arithmetic.

Use interval data types to represent sets of time values or spans of time. An interval value has these characteristics:

Table 3.16Interval Types
Type Description
Year-month These intervals contain only a year value, only a month value, or both. The valid column types are INTERVAL YEAR, INTERVAL YEAR(precision), INTERVAL MONTH, INTERVAL MONTH(precision), INTERVAL YEAR TO MONTH, or INTERVAL YEAR(precision) TO MONTH.
Day-time These intervals can contain a day value, hour value, minute value, second value, or some combination thereof. Some examples of the valid column types are INTERVAL MINUTE, INTERVAL DAY(precision), INTERVAL DAY TO HOUR, INTERVAL DAY(precision) TO SECOND, and INTERVAL MINUTE(precision) TO SECOND(frac_precision).

Tips for Interval Types

Table 3.17DBMS Interval Types
DBMS Types
Access Not supported
SQL Server Not supported
Oracle interval year to month, interval day to second
Db2 Not supported
MySQL Not supported
PostgreSQL interval

Unique Identifiers

Unique identifiers are used to generate primary-key values to identify rows (see “Primary Keys” in Chapter 2). An identifier can be unique universally (large random numbers unique in any context) or only within a specific table (simple serial numbers 1, 2, 3,...). Table 3.18 lists unique-identifier types and attributes for the DBMSs. See the DBMS documentation for size limits and usage restrictions. The SQL standard calls columns with auto-incrementing values identity columns. See also “Generating Sequences” in ChapterĀ 15.

Table 3.18Unique Identifiers (Types or Attributes)
DBMS Types or Attributes
Standard SQL IDENTITY
Access autonumber, replication id
SQL Server uniqueidentifier, identity
Oracle rowid, urowid, sequences
Db2 rowid, identity columns and sequences
MySQL auto_increment attribute
PostgreSQL smallserial, serial, bigserial, uuid

UUIDs

A universally unique ID is called a Universally Unique Identifier (UUID) or a Globally Unique Identifier (GUID). When you define a column to have a UUID data type, your DBMS will generate a random UUID automatically in each new row, probably according to ISO/IEC 9834-8 (iso.org) or IETF RFC 4122 (ietf.org).

A UUID in standard form looks like:

a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11

The letters actually are hexadecimal digits (a–f). Your DBMS might use an alternative form with uppercase hex digits, surrounding braces, or omitted hyphens. UUIDs aren’t technically guaranteed to be unique, but the probability of generating a duplicate ID is so tiny that they should be considered singular. For more information, read the Wikipedia article “Universally unique identifier”.

DBMSs provide functions that return UUIDs. Microsoft SQL Server has NEWID(), Oracle has SYS_GUID(), MySQL has UUID(), and PostgreSQL has gen_random_uuid(). Note that the Db2 function GENERATE_UNIQUE() returns a string that is not a UUID.

Other Data Types

The SQL standard defines other data types than the ones covered in the preceding sections, but some of them rarely are implemented or used in practice (ARRAY, MULTISET, REF, and ROW, for example). More useful are the extended (nonstandard) data types that are available in various DBMSs. Depending on your DBMS, you can find data types for:

User-Defined Types

Microsoft SQL Server, Oracle, Db2, and PostgreSQL let you create user-defined types (UDTs). The simplest UDT is a standard or built-in data type (CHARACTER, INTEGER, and so on) with additional check and other constraints. You can define the data type marital_status, for example, as a single-character CHARACTER data type that allows only the values S, M, W, D, or NULL (for single, married, widowed, divorced, or unknown). More-complex UDTs are similar to classes in object-oriented programming languages such as Java or Python. You can define a UDT once and use it in multiple tables, rather than repeat its definition in each table in which it’s used. Search your DBMS documentation for user-defined type. UDTs are created in standard SQL with the statement CREATE TYPE.

Nulls

When your data are incomplete, you can use a null to represent a missing or unknown value. A null has these characteristics:

Tips for Nulls