Saturday, January 27, 2007

Massive Scalability

Here is an interesting article about how to scale a web site from thousands to tens of millions of users. These guys have been redesigning their application over and over many times in order to get to the next level:
  • First generation: 2 web servers (ColdFusion), one database. Scaled up to .5 million users which is pretty good. The database is the bottleneck.
  • Second generation: more web servers of course, and split the database into functional areas (i.e. one database for login, one for blogs etc.), and use a SAN instead if the machines' local disks. Scaled up to 2 millions users.
  • Third generation: split the database by chunks of 1 million users, plus one last function-specific database for login. Scaled up to 10 millions. The limit was not related to the unique login database but was due to the fact that the databases were not loaded evenly. A ad-hoc approach for moving data between databases didn't work well because it became a full time job for several people.
  • Fourth generation: solve the uneven load problem by a SAN from 3PAR which allows to strip a volume across thousands of disks, so a single volume/DB can deliver much more IOs. Also add a caching tier between web servers and databases, and choose to store transient user data in memory instead of in the DB - a trade-off between reliability and performance (caching could have been done earlier). And finally they rewrote the application in C#/ASP which probably allowed developers to optimize their code. Scaled up to 17 millions.
  • Fifth generation: switch to 4 AMD dual core 64-bit/64 GB RAM machines for databases. They have 65 databases. Scales to 26 millions.
Interestingly also, this stuff runs on Windows machines (I guess security is not an issue). They discovered a few funny facts about Microsoft, for instance if you try to open more connections than the maximum that SQL server can handle, it crashes :) And that Windows has a feature which makes it shutdown if the frequency of incoming network connections exceeds a limit, because it think it is victim of a DOS attack (no pun intended).

No comments: