Without a doubt, the most common and most egregious problem I see repeated over and over in database design is using the wrong data types. This lesson will explore computer and SQL data types and provide some guidelines to avoid these common problems. The next lesson will be a “deep dive” into SQL data types and how best to use them.
What Are Types
In computer parlance, a type is a common representation of something using a lookup table of a fixed size of bits. This is not born of convenience as much as necessity. There is only one native type in computers and it’s the same in every computer: a bit. Every type is derived from this one native item.
A bit is an artifact of computer hardware where there are generally only two states: on and off. This is a “binary” state (one or the other) and so a bit reflects one of two binary states. For convenience, we write these states as zero or one. We’ll dig into types next; but the following sidebar will look at some of the history of using binary to represent more complex types; and this history goes back over 400 years!
A Type (or probably more accurately a “data type”) can be viewed as a lookup container; that is, where a sequence of bits represents an item from the lookup table. Therefore, to create a type, you first determine how many items you need in your table. This is a key concept for typing: use the smallest table you can to keep the bit representation as small as possible.
Let’s look at some SQL examples. In SQL, there is a Bit data type which corresponds with our native bit type and it only has two possible values: 1 or zero (see next sidebar for a possible third) which could correspond to True/False, Yes/No, On/Off, and other binary data options. With this data-type, we only consume one bit per data field. Or, with SQL numeric types, we could go with an eight-bit container called a “byte” (which, as we will see below, is how most computers store our bucket of bits) which could hold up to 2^8 or 256 counts; the numbers would range from zero to 255 (remember, zero is a count). As you can see from a review of SQL numeric data types, you can choose a container up to 8 bytes or 64 bits long! Of course you could use an 8 byte or 64 bit container to hold your binary, yes/no data; but, this is wasted space and should be studiously avoided!
Baudot’s requirements were simple, he only needed to represent upper-case alpha characters. Therefore, a container of five bits, for thirty-two possible values, was adequate. How do we know that? Each location in his 5-bit container had only two possible states. Thus, with five bits, the lookup table could hold up to thirty-two values; that’s two raised to the fifth power (2^5). Good design Baudot – too bad too many SQL developers don’t follow the guidelines you developed four centuries ago 🙁
In our telegraphy problem, we have a stream of bits that we need to translate into text. Early devices used Baudot’s 5-bit for upper case characters but demand for lower cased letters and other characters prompted later designers to move to six and then seven bit containers.
In the early days of electronic communication, a telegraph machine manufacturer could easily use a proprietary and custom lookup table for sending and receiving the data since this was just used for data transmission using their equipment and did not need to be interoperable with other systems. However, as the technology matured, the demand for a standard lookup table that worked across manufacturers and systems forced the standards bodies to step in.
The American Standards Association created the American Standard Code for Information Interchange or “ASCII” specifying the first standard bit container, a 7-bit container which would hold up to 128 values – generally US alphanumeric and a few special characters. Later, Extended-ASCII was introduced which used an 8 bit container. This is a key point: ASCII is just a standard lookup table of a corresponding count of bits.
So far, we’ve seen our technology progress from using data-typed-bits for encryption, and then later for telegraphic communication. The next evolution was in computer storage: disk, memory, etc. In storage, our stream of bits might become fragmented so we need a way to store them in discreet buckets; which could be spread across the storage medium. In most computer systems, this was/is a “Byte” which is an 8-bit container. It’s very important to understand that a Byte is not a data-type. A Byte is simply a container of convenience in which to store an ordered stream of bits (we won’t talk here about whether we read from the left or the right, that’s a topic for another lesson). In order to reconstitute our bit stream from storage, we use Bytes to do so.
A key point: Even though our data is stored in 8-bit “bytes”, the actual Types we associate with our collection of bits can be comprised or more, or less, than these 8 bits. Bytes are important because this is the collection of bits we use to read and write them to/from storage. However, once they are pulled from storage, they are put into containers sized for our appropriate type and they must have the bytes aligned in that container in the correct order. In other words, we are reconstituting our bit stream from storage using bits stored in bytes.
Let’s illustrate how these bits and bytes work lead to data-typing by considering a 16-bit Integer container type. We will ignore the endians for now.
How many values can this container hold? 2^16 or 65,536 values (see next sidebar for discussion about “numbers”).
Let’s say we want to represent 31,312 in our 2-byte or 16-bit container. Our container should look like this: 01111010 01010000. Now if we had a stream of bits, we’d simply take things in order and do our container break after sixteen bits. But we have to contend with computer storage and so we store bits in bytes, in this case two bytes: Byte 1: 01111010 Byte 2: 01010000. Our computer makes our life easy by keeping track of where these two bytes are stored and in what order they need to be organized to represent our type.
Before we move on to other types, let’s take a minute to ask: Why do we care about the size of the container? Storage is cheap, right? Yeah but the little things add up; particularly in a database. We could store small numbers in large multi-byte containers and just pad them with beginning zeros. But then the Operating System would need to track the locations of all those needless bytes. Additionally, it would need to retrieve them from storage and store them in volatile RAM memory; and when we processed the data, we’d need to use the CPU registers that are very limited. The more bytes we have, the more instruction cycles (CPU) we would use. In other words, the impact grows exponentially and so it’s critically important to use the smallest type you can for your data. The larger the data-type container we use, the more system resources it takes to store and process the data!
Key Point: This is one of the most important jobs of a SQL developer and/or DBA: to correctly type the data. It is the most common problem I see with database design and the most common performance issue. When it comes to data types, smaller is always better!
The next lesson will take a deep dive into SQL types but let me make this statement here and I will repeat it frequently. Always use numbers as the key to your tables; and use the smallest numeric type you can for your data.
To give you a sneak peak, let me ask you a question. If you have a key column that has a maximum value of 30,000 entries and you store that as a nchar text value, it will take ten bytes of storage for each value. Can you figure out a better way to type this column; that is, one that will only take 20% as much space or just two bytes?
Good technology just seems to show up over and over.
We’ve had a deep dive into data typing. We studied this in depth because choosing the wrong data types in SQL is the most common and egregious error I’ve seen. The problem is that things “just work” when the datasets are small and the database is not under load. This creates an illusion which often leads to a terrible situation in production where things crash or come to a crawl and the developer responds correctly and sheepishly: “well, it worked on my machine”. Those are probably the worst words ever spoken by any developer.
In the next lesson, we’ll dig into SQL data types.