VARCHAR: Some pointers

Data types are crucial while building a data model. Right data type gives an indication about the behavior of underlying data and its processing constraints. During modeling phase, the developer for example has to choose integer versus decimal or big integer against float or numeric versus double etc.  Data type to be adopted is finalized based on careful back end analysis and user requirements.

In majority of cases, the data type at source is adopted by default to the target unless there are requirements that force conversion of data type. Whenever data type has to be converted either via a requirement or technical issue, care is taken to convert within the same family and to a higher level so that data fidelity is retained.

Blog_Varchar1

There are subtle differences between char and varchar data types. As the name suggests, the primary difference the variable length property that defines the size. Char data utilizes full space allocated whereas Varchar data consumes only required amount of memory as big as the actual data itself. This is managed by using a length indicator along with actual data.

Efficiency is good in terms of space consumed in memory for varchar data types as you can see from greyed area on the right. This might seem as a no-brainer considerinBlog_Varchar2g the definition of the data type itself, and it might look like the char data type can be avoided altogether. There are several scenarios where char data type will be needed.

In data warehousing environments, the varchar data type is typically used while building dimensions. There are few performance related considerations that has impact on the overall performance of the data model over a period of time. Listed below are some commons scenarios that occur frequently.

New Record Insertion: Due to the nature of variable length, the chance of storing the new data in contiguous location is minimal. Every new record size has to be computed first and appropriate memory location has to be determined before storing the same. The processing overhead is slightly higher.

Frequently Updated Dimensions: Whenever the data changes, the size has to be recomputed and suitable memory location assigned. If the size has reduced, the overhead will be small whereas if it increases, the overhead increases. In the example above, if Ram changes to Raman, the entire data block has to be released and a new location with increased space has to be assigned. This again goes back to the first point mentioned above.

Fragmentation: Occurs when data is spread out over randomly all over the memory footprint. If the degree of fragmentation is higher, the cost is even higher for both read and write operations. There are several other causes too for fragmentation and this is bigger topic to be discussed here.

Read: This is related to all issues listed above. Query performance will degrade due to accessing several locations instead of a small memory area.

The items mentioned above really come to the forefront as bottlenecks when large volumes of data is encountered. Declaring a data type as varchar is not a bad choice as it sounds, all these considerations have to be kept in mind when complicated data models are to be designed.