Database layout improvements
From mangos
Contents |
Abstract
Database models
There are several database models, relational, hierarchical, network, object-oriented, ..., each of them is just a bunch of theory and design. It's possible to combine their techniques, create custom models based on those few "standard" ones. That's - obviously - what many projects do.
Database normalization
Another subject with whole bunch of theory, three basic normal forms together with all others. Since I want this idea to remain rather simple and reasonable, I'll skip this with a link to wikipedia for those, who are interested in more.
The one wide table problem
Classic example of this one is "characters" table. Take a look what columns are in there. It may look like a good idea to have one wide table with one index on "guid" and look various information in this one table, however consider what would you gain with a more ... relational database model:
Performance
Everyone today cares about performance. It's a wide subject with many dependencies in other factors, like, for example, concurrency. When you have one table with 200 columns storing all possible data for a given player, every application, every subsystem, every thread is going to use this table. That means no or reduced amount of table locking (since we want performance), it includes not-so-easy row locking as well. Imagine 3 threads updating one row, 4th selecting something from it.
Another aspect of performance is DB server cache. InnoDB is known to have more extensive and advanced caches, but nothing's perfect. Regardless, when you request some data from a row, the storage engine has to (I may be wrong on that) read the whole row (+ cache it) and send you only the requested rows. This had practical impact on some engines in the past where you were unable to have row size larger than 8K (one of references).
The 8K page size is also related to cache. The cache is able to contain some set of rows. When your row has 3K in size, you are going to have a 2K of unused cache per page. This penalty is a lot smaller with smaller row sizes, as you can imagine.
Some MSDN reference I just found confirms it:
The number of rows that a page can hold depends on how compact each row is. A page can hold more rows if the rows are smaller. Therefore, a single disk operation on a table with compact rows can retrieve more rows, making the operation more effective. Additionally, more rows fit in the storage engine cache, potentially improving the hit ratio. Compact rows also prevent wasted space on data pages. This is common with larger rows.
Consider this extreme example: If the record size is somewhat bigger than half of a data page, almost half the space on each data page is wasted.Some database designers opt for wide table design and port their mainframe database schema down to the device. This might not be an efficient design. One possible approach is to break up the most critical tables. Suppose you have a table that has some columns with very stable values and others that change frequently. It makes sense to split the table into two: one with the frequently-referenced columns, and the other with the stable columns. By creating two tables, you have all the benefits of smaller row length. The tradeoff is that a join is required to combine the information.
Therefore, we should split wide tables to smaller ones, using relational keys between them. Those can be either several primary keys (ie. independent tables) or foreign keys.
Concurrent access
It's related to performance. As said earlier, several threads accessing the same row may need locking. If we split out unrelated data out to a separate table, this locking is more distributed or never needed. Thread1 can access table1 and - at the same time - thread2 can access table2 without ANY interference.
One may say it's best to have each field in a separate table and use JOINs where needed. Sure, but it becomes a pain to maintain this, even with views. One simple SELECT with two conditions on WHERE may easily become a 7-line double JOIN using this scheme.
So splitting wide tables is good, to some degree.
Storage efficiency
Having one wide table also involves unused values. Take character table as an example - trans_x, trans_y, trans_z, trans_o, transguid. Those are, I believe, columns for players on transports. They are 0 for .. 99% of players, yet they take up space anyway.
What about moving those columns to a separate table called player_transport and having only non-null values there? Ie. players, who aren't on any transport, simply wouldn't have an entry there.
Next example - resettalent things, which can be part of the "concurrent access" section. Wouldn't it be better to create a player_resettalents table (or similar) with guid, cost and time columns (and possible others in future)? What about taxi_path, does it have to exist for every single player all the time? Why arena_pending_points isn't in some player_arena table (with possibilities of adding other arena-related non-team data for a player).
Those examples don't need any JOIN in actual code when they are separated. They do need one JOIN in rare cases, but it's still a lot more effective and cleaner-by-design than having one wide table.
I see "characters" table as a place for generic character data, ie. account, race, gender, level, XP, ..., but not transport data, taxi, arena, .., things - for the same reason we don't store ticket, queststatus, social, .. there (taxipath is a nice candidate for a separate table - guid, node_num, node_id columns for example .. or simply using one VARBINARY column).
Configurability
Take a look at creature_ai_scripts this time. We currently have a limitation on 3 actions in one AI entry. Wouldn't it be more configurable to have creature_ai_actions with (optional unique and indexed actionid), id, action and parameters columns? It would allow creation of variable amount of actions per one AI entry (id).
Epilogue
Splitting optional data (= not always filled) out from wide tables could save us a lot of space and cache misses. I guess .. 20% of current DB data is wasted on unused values (thanks to large/wide table design). The size reduction can also be made more obvious by reducing duplicate records. Someone could redesign AI tables a bit, so that one AI "id" could be used for several creatures, effectively eliminating duplicates. Think about it.
--freghar