FaceStat scales!

June 6th, 2008 by Brendan O'Connor

Before last weekend, our FaceStat website was chugging away with a small but loyal userbase:

But on Sunday, an insane number of people suddenly decided to flock to our site. Let’s extend the previous chart by 2 days, then a little bit of y-axis auto-scaling says it all:

Turns out the giant spike was due to our being featured via a news article on Yahoo.com’s front page!

Of course, we had to frantically rearchitect the system and scale it under this deluge of traffic. You can read the blow-by-blow account of our crazy few days on Lukas’s blog, here.

The web startup community seems pretty interested in the mad scaling issues, so I’ll respond to some of the comments on Lukas’s blog below:

Yes, we’re pretty much using Rails. We actually use an offshoot called Merb — which is a bit more efficient — on top of Thin. We find that a Rails-like platform is invaluable for rapidly prototyping a new site, especially since we started FaceStat as a pure experiment with no idea whether people would like it or not, and with a very different feature set in mind compared to what it later became. And it’s invaluable that Chris on our team is such a Ruby expert :).

However, the high-level platform really doesn’t matter compared to overall architecture: how we use the database (postgres), how much we cache (memcached/merb-cache), how we distribute load, how we deploy new systems (xen/slicehost), etc. It’s hasn’t been trivial since FaceStat is write-heavy and performs fairly complex statistical calculations, and various issues remain. But we are serving many users at nearly 100x our old load, so something must be going right — at least for now!

-Brendan

p.s. Thank you, Google Analytics, for the above charts. Some day when I grow up, I hope I am wise enough to create an equally brilliant data visualization tool.

17 Responses to “FaceStat scales!”

  1. Brandon Franklin Says:

    I somewhat disagree with the title of this post. Every evening, your response time just goes to crap. The site barely seems to be hanging together at all. Building this in RoR was ridiculous, IMO. Why on earth didn’t you use something like Grails that actually DOES scale? Did you know that ehcache that comes with Grails is 500 to 1000 times faster than memcached, which I assume you’re using?

    http://gregluck.com/blog/archives/2007/05/comparing_memca.html

    I know RoR is the “hot thing” right now but seriously, it’s frustrating to see sites just get HAMMERED like this when it didn’t need to happen. Does everybody just jump straight to RoR without even researching alternatives first?

    I really enjoy the site CONCEPT that you’ve built, but my enjoyment is decreased by the horrible performance I see every day from the site.

  2. Brandon Franklin Says:

    BTW I do want to say that using Slicehost was a good call, but I do disagree with your assertion that the “high level architecture doesn’t matter”. Don’t you think those complex statistical calculations would have run better in a JVM than in a Ruby interpreter?

  3. brendano Says:

    Brandon: You’re right, Facestat is definitely having intermittent issues. We recently pushed some new features that used the DB in a too-aggressive way, which was responsible for some of the poor performance yesterday (or so; I can’t remember when, exactly.)

    I stand by my claim that Rails (or rather, Merb) has little to do with the latency problems. It’s really an issue of overall architecture, how the DB is used, etc. The main issue with the statistics calculations is that they need to access a decent chunk of data via a join, so require relatively expensive DB queries. This doesn’t have much to do with Ruby interpreter speed, as poor as it is. It does have lots to do with decisions about batch vs. incremental computation, details with what to cache when, and other things unrelated to the app server framework. In fact, many of the statistics calculations are done with SQL’s grouping and arithmetic operators, but our overall performance got *faster* when we moved parts of them to Ruby — even though Ruby is definitely slower than the PostgreSQL engine or, as you point out, a JVM — simply because the centralized DB was taking a high load and the app servers had cycles to spare.

    There are lots of issues like this that aren’t really about the app server framework. Ehecache and Grails sound interesting, but I doubt they or anything else would magically solve our problems.

    I’m curious why Rails elicits such strong reactions from people. It’s a nice tool with various limitations and quirks; no more, no less. And certainly its role in a web system is far less important compared to what you would think if all you knew about it was from reading angry Internet rants — either for or against.

  4. Brandon Franklin Says:

    Thanks for the detailed response. I do wish you luck with figuring out how to tame this beast. I think it’s a lot of fun, and very addictive.

    I want to be clear that I have no problem with Rails itself. Grails is “Groovy on Rails” and it is basically using Groovy (a JVM-based dynamic language) to do all the cool stuff that Rails brought to the table.

    My problem with RoR *specifically* is that it uses one Thread per connection (as I understand it) which can quickly kill a server, whereas Grails is able to run in any standard Java App Server (like Jetty or Tomcat) and therefore instantly gains all the performance benefits of that.

    If people are anti-RoR/Grails in general then I think they’re probably just oldschool and would be the same type who claim “Java is too slow, we need to use C++” etc.

  5. Chris Van Pelt Says:

    Brandon: One thread per connection does blow, however Java blows more. I’m sure Grails does some neat stuff, but anyone who has programmed both Java and Ruby and has the thought “I enjoy writing Java more than writing Ruby” should be considered potentially harmful and dangerous.

    Ruby is getting faster, and there are many Ruby project underway to do away with the single thread per connection problem that Rails and ActiveRecord introduce. That is one of the reasons we use Merb instead of Rails. As soon as Datamapper is in a more stable state, we can support multiple threads running off a single instance. Until than, I’ll fire up a few more instances of my application so that I can actually enjoy the language I use everyday.

  6. Ed W Says:

    650K hits per day is “only” around 8 per second (assumed a 20 hour day to spike it a little). This doesnt actual seem all that much?

  7. James Higginbotham Says:

    Brendan,

    Congrats on the recent success! Being a Rails developer and Slicehost customer myself, I’m curious about more of the details of scaling with Slicehost.

    While I love their service, not having a virtual public IP seems to force you into using round-robin DNS between slices, correct? Or, did you do something different to allow for more than one slice to service your web tier?

  8. PJ Hyett Says:

    @brandon no one codes in java because they want to.

  9. Lukas Says:

    650k hits is actually what got through to Google analytics. My best guess is that the actual traffic was 5-10 times higher. Also the spikiness was a lot higher than reducing to 20hrs per day. Our peak hourly load even without yhoo post is several times higher than average.

  10. brendano Says:

    James: that’s right, we currently use round-robin between web server slices. Every slice has its own public IP; I think I’m not understanding your concern. We certainly could point our DNS records at a single slice and run a reverse proxy on there.

    PJ, Chris: To be fair, Grails is in the Groovy language, not Java. I rather liked Groovy the last time I used it; the biggest issue was that the compiler was hideously slow, though I’m sure it’s faster now. Probably F#/.NET is the most advanced dynamic language platform anyway :)

  11. James Higginbotham Says:

    Brendan,

    I was just curious how you found Slicehost to work in this situation and if Slicehost had a different solution that they weren’t announcing publicly for situations like yours.

    I have personally steered away from the round robin DNS approach in the past. If one of your frontline servers is down, it can take a little time to propagate DNS updates to remove the server from the round-robin. Some DNS servers will cache the old IP for a period of time, thereby preventing access to your app until their cache is flushed or until the slice is operational once again. RR DNS is not the best solution, but the best one with Slicehost and it gets the job done in a pinch. A reverse proxy would be optimal to prevent this as you mentioned.

    NAT is the one feature I’d love to see Slicehost add, as it would allow larger customers to add new slices behind their single public IP without depending upon a single slice to do the reverse proxy work or stale DNS situations when using RR DNS. I noticed that Amazon EC2 now has this, though I haven’t investigated it fully.

  12. Brandon Franklin Says:

    LOL I’d like to see anybody code the medical imaging client/server application I work on in Ruby instead of Java and get it to perform at all!

    Besides I’m not talking about Java vs. Ruby. I’m talking about Grails vs. Ruby on Rails. Groovy != Java.

    As far as I can tell, Grails is more beautiful and elegant than RoR.

    It is amusing to see Ruby fanbois on the attack though. I’ll admit I’ve never seen that before.

  13. Brandon Franklin Says:

    Ohhh I get it, I guess the Ruby people don’t understand that “running on a JVM” doesn’t mean “coding in Java”. Might wanna do a little research first, peeps. See: JRuby, Jython, and Groovy.

  14. Dude Says:

    Brandon, why do you have to be such a dick ?
    Ruby != Groovy… understandable… but…

    Ruby is an extremely beautiful language, tons of users, awesome community, with a bunch of love from Sun (ala JRuby, etc).

    Yeah, there are a lot of fan boys in every language… and you could say especially people coming from… say… PHP to Rails, many of those who don’t know anything about OOP, and still rave about how awesome it is without knowing anything about it.

    Alas… many come from Java to Ruby, for good reasons. So, don’t be such a prick.

  15. Stephan Schmidt Says:

    “I enjoy writing Java more than writing Ruby” should be considered potentially harmful and dangerous.

    I enjoy writing Java more than writing Ruby. So consider me dangerous!

    “@brandon no one codes in java because they want to.”

    I do!

    Peace
    -stephan

  16. Brian Hutchison Says:

    Hello, we’re going live with a nginx fronted site and I’d like to know how you increased your open file handling limits. Did you just change worker_rlimit_nofile? Did you change any other parts of your conf? If you have any thoughts to share on output_buffers & postpone_output, I’d love to hear about those too! :)

    Great post.

  17. Brian Hutchison Says:

    Answering my above question, I found info on output_buffers here:
    http://www.ruby-forum.com/topic/152021#new

Leave a Reply