Sunday, May 22, 2011

Cassandra - Part 1

As usual I will be discussing about why we moved from Amazon RDS (MySQL) to Cassandra (NOSQL), our experiences with Cassandra, and how we implemented it.

The main reason we moved from RDS (MySQL) is latency/response times. Read/Write operations took on average 80-100 ms which wasn't acceptable to us and we had to depend heavily on client cookies to store information. We started our development using RDS as short term choice and to bring our solution quick to market.

The second reason is cross data center replication. Though RDS supports read-replicas and Multi-AZ deployment but it was limited to the region. So we had to write lot of batch jobs or on the fly tasks to sync information between regions.

These two issues made us to push for new solutions. We researched and short listed Cassandra, HBase and MongoDB. We decided to go with Cassandra for these reasons: fast writes, P2P architecture (No Master/Slave) and built in cross data center support. We did use HBase for OLAP and in the process of migration to Brisk.

Response Times:
Average response times for writes are around 8ms (which is 10X better) and reads are around 10-15ms.  We are in the process of removing memcached from our PHP front-end. With removal of memcached our architecture will be simple and clean.

P2P/Gossip:
There is no single point of failure and if a node fails the impact is minimal. Application can communicate with any node in the cluster to read/write the data thus increasing DB/App throughput tremendously.

Sharding:
Cassandra takes care of sharding and it is seem-less and application doesn't have to manage it.

Cross Data Center:
We have 2 DC's and in the process of setting up 3rd DC, We use replication factor of 2 in each DC and at any time we have 4 copies of the given data.

Hadoop Support:
We extract data real time using streaming and move it to our data warehouse for analytics.  It really simplified our ETL process and able to keep OLAP data near real time.

 We did face quite lot of operational challenges which I will discuss in Part 2.

Thanks for reading my blog and have a nice time!

Wednesday, April 27, 2011

Cloud Provider - Amazon

Goal is not to talk about benefits of cloud,  just share our experiences with Amazon.

Good things first!

Management Interface: Amazon really stands out here with ease of use of tools; its well packaged, excellent  API and easy to use AWS management console.

AMI: Lot of community AMI's enables to quickly launch VM with required software pre-installed and configured with minimal effort. We used both Amazon's free AMI and third party AMI.

Security: Basic integrated firewall, you can easily setup basic security controls at the IP/Port level.

Database: Managed MySQL (RDS) works. Lot of DB management tasks like backup, monitoring replication etc are handled by Amazon support and we don't have to worry about it.

Functionality: You name it and Amazon has a solution for it. They are more like one-stop shop.

Backups: With the help of EBS and S3, backups are made easy.

Now the bad stuff!

1. Support: They have automated way too much and lacks human touch. It takes days if not weeks to get response from their sales support team. Also you need to enroll for Gold or Platinum level support to get phone support or quick turnaround and it can cost up to 10% of monthly bill.

2. Performance: I have lot to complain here:
  • Instance CPU lacks power compared to competition. Process that used to take 40-50% CPU on large instance in EC2 runs quietly on our new cloud partner's platform medium instance.
  • Inconsistent performance - At times we paid high price for load issues in Amazon's cloud. We had to closely keep eye on our Keynote monitoring and divert traffic between regions.
  • RDS - Write or read operation on average takes upto 80 - 100 milliseconds. When our goal is response time under 100ms we had to constantly redesign our db operations. This made us to migrate to Cassandra very quickly.
 3. Failures: We have been pretty happy until the recent mega crash in their east DC. Initially we weren't sure whether it's a temporary glitch or bigger problem and not sure on ETA. Fortunately, we were able to divert our traffic to N. California by updating our geo-load balancing rules.

Will share our experience with our new cloud provider later.

Thanks,
Manicka

Wednesday, April 20, 2011

DB Sharding

This is not new and its in practice for close to a decade. Goal of this posting is to share how we implemented it in Amazon cloud.

When we started development we evaluated NOSQL databases before using DB Sharding. Due to lack of internal expertise, time and resource constraints we decided to go with traditional approach of sharding. We did move to Cassandra in iteration 2 and threw away all the hard work of iteration 1.

We are an Ad Network, currently serving daily 10-20 million ad impressions  for 2-3 million unique users.  We have no in-house DBA and wanted our db to be elastically scalable with minimum administrative efforts. Amazon RDS is a wonderful service if you are looking for minimal upfront cost, simplified DB management, zero DB software license cost and scalability.

Firstly,  we came up with a logic on how we create  cookie id's (primary key used for sharding) based on the region, number of RDS servers in the region and number of databases in each server.  We made concrete decision to sacrifice certain user cookie data when they get connected to different region. This will happen only when user travels from coast to coast.

Secondly, based on our privacy policy cookie data is retained only for 6 months. We decided to use it to our advantage and used MySQL table partition to store the cookie created in a given month in one physical file. We had a maintenance job that runs every month to reset the data file for six month old data on 8th month. Viola! Millions of records that are not needed are deleted in seconds  like magic without any costly delete or truncate operation.

Thirdly, we implemented our own Zend_DB_Table class that took care of connecting to correct RDS server, correct DB within the server based on cookie id depending on read or write operation.

Fourth, we used Multi-AZ RDS deployment to get the built-in failover protection. We never tried to see how it works and were fortunate and lucky enough quickly switched to Cassandra for OLTP.

Fifth, we had two write master RDS instances (one in each region) and two read replica's (one in each region) and each write RDS had 8 instances of our cookie database. This really helped us in scale and performance. All the reads and writes were screaming fast and never experienced dead-locks/performance issues.

With this solution we managed our explosive DB growth easily with no DBA and zero administrative costs. So far I have shared only good things about Amazon MySQL RDS and  will share my negative thoughts/limitations when I blog about our iteration 2 architecture.

Next post thinking about sharing our experience with Amazon cloud in general.

Hope you enjoyed it!

Thanks,
Manicka

Tuesday, April 19, 2011

Geo-Load Balancing

During my learning years at previous job, we did launch our secondary DC but it was in active/passive rather than active/active. We were worried about DB replication, latency and implementing sharding for transactional systems. At the very end, we had solid ideas for DB Sharding but couldn't implement it for various reasons.

At current job,  I was lucky to start everything from scratch. We designed it for active/active using Amazon EC2, ELB, MySQL RDS and Dyn Inc's Dynect CDN Manager for our first iteration. We did migrate completely to different architecture in iteration two more about it later.

Our goal was to do all our processing of ad optimization and hand out the hints to our ad server within 100 ms. To do this we had to reduce latency issues as much as possible so we decided to take advantage of cloud solutions and geo load balancing.

Amazon Elastic Load Balancer (ELB) supports only load balancing with in the region  and we had to look for options to handle the geo load balancing. After lot of research we found Dyn Inc solution was very cost efficient and had products that met our needs. We used Dynect CDN Manager since ELB doesn't have static IP address. Also we defined  rules in CDN to send 100% of our west traffic to Amazon US-WEST ELB, 100% of east traffic to US-EAST ELB and the rest we split traffic to both regions.

Behind Amazon ELB we took advantage of multiple zones in each region purely for DR than for geo-load balancing, we had equal number of web servers in all zones, used central RDS in each region for DB writes, read replica's in each zone for reads. It looks easy when you read but it was interesting and tough journey for my team as they had to build and implement the entire solution in 3 months.

Note: Our support from Dyn Inc is really good and sometimes wonder whether our sales contact John D'Amato is sales or support person.

I am thinking of next topic to be either cloud providers or DB Sharding. As I mentioned earlier though we started with DB sharding moved to BigData.

Hope you enjoyed my first post!

Thanks
Manicka

Me too.....

After lot of thoughts I have started blogging now. To start with I am planning to share my experiences in building elastic scalable systems. My goal is to blog about my professional experience than of my personal experiences.

First thing first:

I have a habit of naming the product before starting it. At my current employer, we named our products as mInstinct, mShare, mAudience etc. So thought of naming this as mStory and get the url as mStory.blogspot.com. But for now its going to be manickababu.blogspot.com.

Next Topic:
Geo-Load Balancing