-
Website
http://www.scobleizer.com/ -
Original page
http://scobleizer.com/2008/05/26/should-services-charge-super-users/ -
Subscribe
All Comments -
Community
-
Top Commenters
-
danja
44 comments · 4 points
-
polizeros
52 comments · 1 points
-
AndyBeard
69 comments · 4 points
-
Zachary Adam Cohen
35 comments · 8 points
-
dbarefoot
40 comments · 3 points
-
-
Popular Threads
-
The best and worst thing Twitter did in 2009: RT
22 hours ago · 20 comments
-
World-brand-building mistakes France’s entrepreneurs make
1 week ago · 181 comments
-
2010: the year SEO isn’t important anymore
6 days ago · 67 comments
-
iPhone developers abandoning app model for HTML5?
6 days ago · 51 comments
-
Google eating Yelp?
5 days ago · 25 comments
-
The best and worst thing Twitter did in 2009: RT
Everytime you update, Twitter has to get a list of your 25k followers, sort out any @ replies, find out what their notification settings are, notify each and everyone individually and add a message to their feed (even if it's still the same one). All this while their feeds are being hit like crazy by desktop clients.
So, Twitter is a notification system with multiple entries and exit points. Friendfeed is an aggregator. It doesn't, as far as I know, notify anyone.
Anyway, here's another really interesting conversation cluster that this post started over on FriendFeed: http://friendfeed.com/e/a2463347-f07a-ab3f-4f41...
As to architecture. OK, let's have one object:
Scoble's Tweets.
Then let's have another object.
Jane Smith's Tweets.
Now let's have a third object:
John Schmidt's Tweet page that displays both Jane's and Scoble's Tweets.
Sounds like Scoble's and Jane's Tweets are being copied, right?
No.
In fact, if John Schmidt never uses his account, nothing happens at all.
But, let's say that John Schmidt opened his Web browser and visited Twitter. Well, ONLY THEN does John Schmidt's object (which knows which Tweets it should go look for) talk to the other two objects, and say "give me your Tweets." Then John's object mashes them together and displays them to John. It also, then, closes down and releases all memory and disk space until the next time John asks for something.
This does not change if there are a million "objects" being mashed up. No copies are living permanently. Just the original objects.
Got it yet? I'll do a video, if you want to understand it more.
How do I know this? Ask the Exchange team how it keeps stuff from duplicating all over the place and causing server disks from filling up.
Absolutely wrong.
Only gets copied if a user instantiates his object and asks for those things. Even then, it's not "copied" except to display it, and that copy is temporary and stored in your browser, or in your Google Talk account.
Why not? As they scale up their system - the number of users is growing just as fast. If they scale just quick enough to stay one step behind the problem they will continue to have issues.
I don't blame them - it's a difficult problem and not many sites have to cope with such massive growth so quickly.
And ever since the first time it went down, chances are they've been patching and optimizing things here and there, when perhaps what Twitter needs is a complete remake - which shouldn't really be THAT hard considering Twitter is above all, a very simple application - that thing doesn't put a spacecraft in Mars - so the main focus should be scalability. Perhaps they're doing that already. If not, they should.
On the other hand, FF most likely has been created with scalability in mind, and so far, other than throwing hardware at it, as long as they're somewhat ahead of the growth game, it doesn't need anything to stay afloat as it grows. It's not rocket science either - they simply didn't (supposedly) ignore the possibility of growth when they started to write their software. Which is what everyone should do when starting a project, and there's plenty of documentation out there and plenty of great engineers who know how to architect a simple (or complex) app so that it will scale if necessary.
Leaving that aside, the business model is a very interesting and fair question. No, I don't agree with Om. Not because I don't think super-users shouldn't be charged, but because charging super-users doesn't fix anything, scalability-wise. I also don't think Om understands how Twitter works internally. Ok, *I* don't know how Twitter works, but if it does the way Om describes it, then the folks at Twitter absolutely definitely need to rewrite the whole thing from scratch. Personally I didn't like neither Obasanjo's nor Om's articles at all. You? Well, you're talking about Twitter and FriendFeed, and a bit of Facebook. Thank god for that "This is why I love the tech industry" article, because it is for posts like that I'm still reading you. (No offense, I just don't use neither Tw nor FF, so this fun madness you guys have is completely out of my radar...)
Sweet how you never had to work with an Exchange server which did exactly that, and then added 'All' as a recipient to the address book of every user.
I'll grant it doesn't do it now. But it sure as hell used to.
I know it did. Which is why some people still don't understand the architecture that Exchange uses (which is why I was "educated" on the issue).
By the way, this caused a famous and massive problem inside Microsoft when the database server filled up when someone accidentally emailed something to "all." Email went down for two days, the way I heard it.
Robert if you remember in the bad old days :-) when Blogger was crashing all the time they offered a Pro service where you paid in the hope of some reliability - fortunately Google took them out and over a period of a year or two sorted out the problems. I hope that Google do the same with Twitter :-)
As far as the workflow for Twitter vs. the workflow of FriendFeed, it's impossibly unfair to compare Twitter to FriendFeed (yet). Twitter is pushing updates the moment you send an update. FriendFeed isn't doing instant updates via XMPP (Jabber) or SMS.
Additionally, Twitter is at the "oh wow, if I follow 10,000 people I'll probably have 1,000 follow me back and I can spam them." This is making a large number of "super users", not just you Robert :-) They're getting hammered in traffic compared to FriendFeed.
Let's compare the numbers in terms of service reliability and overall load (rounded down)... You've got 10,000 followers on FriendFeed and 20,000 on Twitter. If this is a true representation of the population on each service (it's not, but we'll pretend), this means Twitter has double the traffic of users. Double the traffic, in a push based service, does not mean double the load... There are double the updates to double the followers.
A semi-decent formula for load based on the above:
Twitter != FriendFeed x 2
Twitter = FriendFeed ^ 2
I wonder how much the outages are driving people into Pownce and Jaiku. I know of at least one of my 'Twitter friends' who is going *back* to Jaiku because of the service problems.
So, Twitter is a sort of messagging system such as IM but in a public way (but you can also set a protected status, why are u frustated?) and as the team write the system that "Twitter was not architected as a messaging system":
http://dev.twitter.com/2008/05/twittering-about...
With Twitter, reading generally happens more often than writing, especially when you have desktop clients built around polling. That implies going with solution (B), which has some big problems - most databases aren't set up to deal efficiently with lots of writes.
So, you can try to work it with solution (A), but then you need lots of muscle for all these joined queries. If you're using database sharding, you'll probably need to issue queries to multiple databases running on multiple machines, and join all the results and sort them by time, per each user page refresh or desktop client poll. That's a lot of work per user.
It sounds pretty expensive - better cache it. Leads to a hybrid solution; single write, rare combination reads but not too often (i.e. not every poll or page refresh). Some risk of stale updates.
No matter which way you look at it, though, the scaling isn't quite linear, as some of the old folks will follow new folks as they get added. It should ultimately end up as linear, though with a high constant factor, that constant determined by the average "noise threshold" per user.
Looking at the pure "unit of work", lots of writes probably beats lots of reads, because the reading solution requires sorting and, with the addition of caching layers, has cache coherency problems. Writing can be based around appending to queues.
Also, all the "extra" features that Twitter-folks (in their blogs at least) seem to think are so essential, are quite costly to implement.
Robert, as was already pointed out this was once true for Exchange, but regardless I fail to see how you can make this same assumption for Twitter.
Regardless of how many times it's stored, Twitter also has a tougher routing problem. With Exchange, the sender defines where the message will be received. Twitter is fundamentally different - the sender broadcasts the message, and then the system needs to figure out where to deliver it. This means some of your 25,000 followers - remember, it still has to figure out if I will receive the message based on whether it's an @reply and what my settings are.
Twitter also has to deliver it to the countless number of tracks. Let's assume that the average word length for English is 5.10 (http://blogamundo.net/lab/wordlengths/). On twitter, it's likely less given the 140 char limitation, we tend to use more abbreviations and generally shorter words. Taking out, let's say, 30 chars for punctuation - that means there are 20 distinct words. Twitter in turn needs to figure out who is tracking what, and the track functionality supports tracking word1+word2+word3. Obviously there are a number of ways to implement this more efficiently, but in effect Twitter has to do a fair amount of processing to see if a given message should be delivered to a given person's track queue.
It's clear that they have a bottleneck somewhere. Given the roots of the service, it's pretty clear the architecture didn't plan for this kind of use - and they admitted it in the link Dario posted. None of us really know what's going on behind the scenes, but based on what little evidence we have Dare's scenario seems plausible and perhaps likely.
Ignoring some of the differences in how the service is used, the other thing that FriendFeed had was the luxury of architecting their system after they saw how Twitter was being used. Twitter likely would have done things differently with the benefit of hindsight, but it sounds like (from interviews with Blaine) that much of their time was spent fighting fires as opposed to re-engineering the system.
Open up Twitter... now, did you wait several minutes for your page to appear? If not, then something's being cached on the server side. It could be via memcached, it could be via "baking" your page instead of "frying" it, or whatever. But the data isn't being collected on the fly as you seem to believe. It's being pushed into the cache when you're not around to ensure UI response times remain tolerable.
Dare's point was that Twitter was built as a micro-blogging system, and that's how blogging systems work. You cache the hell outta everything, and you make a choice... make some users wait for extended page renders, or burn cycles in the background to ensure that everyone gets equal treatment.
Twitter does store multiple copies of each message, they've said so repeatedly in various presentations.
http://highscalability.com/scaling-twitter-maki...
Dare's post would make sense if they have now moved to a sharded structure but my best guess is that they haven't had a chance to do that yet.
It seems there will be duplication at least in the caching layer (memcached),
everytime Scoble sends a message 25,000 per user caches get invalidated and will need repopulating by new SQL queries.
Twitter are looking to get rid of the "with others" tab from a user to avoid at least some of this very type of problem, see here:
http://groups.google.com/group/twitter-developm...
I think charging heavy users is the wrong model.
- Money won't help twitter right now.
- Charging won't deter "superusers".
They shouldn't charge, they should ban.
I think you're referring to Bedlam DL3, which was quite different from this. It was summarized by the Exchange team back in 2004: http://msexchangeteam.com/archive/2004/04/08/10...
BTW, the issue you are describing with Exchange failing is documented here:
http://msexchangeteam.com/archive/2004/04/08/10...
And it wasn't a failure for the reasons you describe. It has to do with numerous issues and failures unrelated to db scale.
Open up FriendFeed. Refresh many times. DId the page change? It certainly wasn't pre-cached before I hit the servers.
Computers now are fast, if you have the right architecture.
How does Google work? It is always fast and doesn't pre-cache all my pages. To do that it'd have to know what I'm thinking before I actually searched for something.
One thing you haven't thought about is that even if everything was precached that only a small percentage of my 23,000 followers ever log into Twitter. So, if it's building a page for each of the 23,000 followers it's totally wasting resources.
Robert Scoble or Dare Obasanjo?
lol
It isn't clear to me why you are taking my post so personally. Regardless of how Twitter is implemented, allowing a user to have 25,000 followers and 25,000 people they are following will cause scale problems. There are different optimizations you could make (Single Instancing is not the panacea you claim, see my post at for http://www.25hoursaday.com/weblog/2008/05/26/So... more) but it doesn't change the fact that Twitter has made some bad design and feature decisions.
As to whether people who generate massive load on the system should be charged...isn't that a fact of life everywhere else? Internet service providers like Comcast are known to fire customers who use too much bandwidth, in fact your buddy Dave Winer just blogged about that happening to him. Flickr, Y! Mail and a bunch of other services also charge for "pro" features. Why would Twitter pursuing such a business model be so wrong? Would you prefer to have ads in your Twitter streams?
Also, many users don't even use the Web interface. Most of the time I'm looking at messages coming at me in Google Talk. Those are coming one at a time at me. Are you really seriously expecting me to believe that Twitter copies messages 23,000 times before sending them out to me via the XMPP database?
I'm not sure why you are being so assumptive about their architecture unless some one laid it out to you. Further some of your statements in defense(?) of what they may or may not be doing don't even make sense.
I'm not being assumptive. I haven't said one way or another what they are doing because I have no idea. I only know of the massive large scale systems we have at Microsoft and the relative pros and cons of each. I also know each is designed to meet one general architecturural need and generally these things don't translate well to serve different kinds of IO. So that you might find is that within any large system you have dozens or more subsystems specifically designed to one scale problem. Some of those will require creating duplicate copies of the data if read performance is required to make your application scale OR be responsive.
Have you read the article referred by Al3x in Twitter's dev blog: http://www.hueniverse.com/hueniverse/2008/03/on...
It explains why there is a need to duplicate copies, the main reason is the speed for other APIs (i.e. FriendFeed) to read them and one of the reasons why there is not a huge delay (i.e. 10 seconds) when the tweets appear in FriendFeed, without the copying that time will be longer.
You are trying to argue that twitter is using a 'pivot table' - so you have one table for users, one table for messages, and a third table that describes your friend relationships. When a query comes into to see a particular users stream you think they 'mix' this up, so you do a many-to-many lookup, so for every user (25k in your case) you then look in every one of those users message queues for the most recent messages then mash them together.
Now they may have started with a 'obvious' schema like this about a year ago but I can assure you 1000% that this does NOT scale very far and certainly not up to the point they have got. The reason? because many-to-many lookups in any RDBM are extremely costly and secondly it is very hard to scale across hardware when you build like this, because it is almost impossible to shard because the many-to-many means everyone can potentially be joined together.
The second methodology described which you laughed at, IS SCALABLE - because you can shard to as many machine as you like for an example lets say each shard owns (10,000) users - each message you send just has to send a tiny signal to each shard of your new message - each shard then looks up within its own local database of 10,000 users to see if any of them are following you. It then adds your message to their queue.
This is a classic normalization vs de-normalization - you describe normalization in how you think it works - what I hope (and I am sure they are doing a variant of) is de-normalization.
http://thoughtindustry.blogspot.com/2008/05/twi...
In my scenario there ARE copies. Just not automatic ones. Also, Twitter only needs to keep the last 10 Tweets cached on each user's page, to keep the home page fast. Other pages take forever to load, so I doubt those are cached. Even in the home page scenario my Tweets would only be copied to those users who haven't had my Tweets replaced by other users (most of the time my Tweets would be pushed lower, so there wouldn't be 23,000 copies, only, maybe 1,000).
Either way, if I'm to blame for Twitter going down, why isn't FriendFeed going down? There's a lot more activity on FriendFeed surrounding my messages (and they aren't cached in any obvious way) and it's been down about 1/100th as much as Twitter.
There are several critical differences between Twitter and e-mail, however. The push or notification aspect is one, but message size is a big one. In particular, each of the hard links pointing to a single instance of an e-mail will be bigger than the entire body of a Tweet! Duplicating messages, even in pathological cases like Scoble's, is trivial: 25,000 copies of a 140 byte message represents a mere 3.5 Mbytes, smaller than a single large e-mail body!
Similarly, I think you're overestimating the burden of keeping pre-calculated per-viewer data around: the default view has about 16 messages, each 140 bytes plus a bit of metadata (sender username/icon URL), total perhaps 3.2K. 10,000 users on the server? 32 Mb! Trivial. Even ten *million* users on a single node would fit on a PC you can buy online from Dell!
The best architecture is probably a hybrid: keep the recent message queue in RAM for active users (and update realtime when those they follow post messages), built the cache from disk when they log in. Even on a single host, with 15kRPM drives (4ms writes), that's 100 spindle-seconds; a pair of Apple's 16-drive arrays and you're looking at three seconds to process a Scoble-tweet, ignoring both write merging and RAID overhead.
In reality, of course, you can omit a lot of those write-barriers and re-issue the writes from a redo log in the event of a crash, cutting the write load still further. Mirror the writes and distribute reads consistently, you get failover and gain cache hits to boot (each server only sees half as many active users).
Or you write it all in Ruby and SQL then throw a kajillion dollars worth of hardware at making it all sort of work most of the time through brute force. Even $15m can only buy you so much brute force, though...
Robert, I'm sure from reading these comments that the people talking about the technical problems understand how a the normalized databases they teach you in Computer Science course work. It just that large system can't use them (flickr for example doesn't it's sharded / de-normalized)
I don't think is twitter is sharded yet since they weren't at 350,000 users (http://highscalability.com/scaling-twitter-maki...)
They certainly SHOULD be copying messages around if they are sharded.
You would think if they could get to 350,000 users on one database they could get to 1 million users by adding some database read-only slave servers.
Scaling isn't about saving disk space, CPU cycles, memory - that is being efficient it's not the same thing. Microsoft might try that with Exchange to reduce their customers hardware costs (not that it works from what I hear)
Scaling is knowing you can buy a rack of machines of servers and actually make them reduce your load.
As far as “charging super users” goes it isn’t really worth arguing because its going to be different for every service.
This is why you need a business model. To determine which ways of making money will be most effective and execute on them. Charging super users will be right in some cases while being wrong in others (depending on how much value the company in question can put into the “charged” scenario)
Could a team of competent software engineers build a system which could handle this many users? Yes!
Should twitter have a system which can handle this many users? Yes!
I don't understand why people are so keen to defend poor service. If it's broke, then the twitter guys should fix it. That means better code, more servers, what ever it takes.
If the problem is that they can't find a way to monetize it, then that's a different problem, but one where having lots of users should help, not hinder.
Charging 10 cents per month to each follower after the first would reap a far greater income for them, and annoy interesting twitterers less!
See my blog post for more:
http://falkayn.blogspot.com/2008/05/oms-got-wro...
Exchange never ever stored a message per user. If all users are on the same Exchange server and sent a message from someone on the same Exchange server, it it only stored ONCE. That's been the case since Exchange 4.0. Bedlam had more to do with people hitting Reply All to an alias that had users on different servers. It was the message queue that caused the primary problems during Bedlam.
In Exchange 2007 there is a deemphasis on SIS--it only applies to attachments. Not sure what the scaling problems are with Twitter as I have no idea how the system is designed. But, it would be safe to figure that whether or not they use SIS is not the source of their instability.
Now, back to your regularly scheduled debate about an non-scalable, useless communication tool.
Solution? Create a new service on top of twitter for twitter-streams, because obviously people don't get the idea behind 120-character limits (by the way, SMS has a 160-character limit) and hold their tweetstorms in a buffer to digest and spew out to followers when the server load can handle it.
Angus has the right idea - charging the followers - although I don't agree on the analysis. Still using Robert as the super-user, he should not be charged because of his tweets, but for the number of people he is following. Each tweet sent by the friends Robert is following will be copied on his queue (well the tweet ID) and the size and freshness of this queue (visually the 'With Others') can be used as the factor to charge.
Here's a fact that people are overlooking. Traffic brings revenue. Let's say that the 25,000 posts get counted. What does that really amount to?
That means that 25,000 people are looking at what Robert Scoble is saying. If I was a person that wants to get out my product name, then I think I would pay Twitter to keep the service going. Better yet I might ask and pay Robert to push my wares.
About a month back I @(replied) Robert on something. I believe it was during one of many Twitters' "Problems". The exchange was short and sweet. However, I looked at my Followers an hour later, it jumped up 15 (which it doesn't normally do).
I tested the water by @ another person. The same thing happened. I gain more followers by replying to high profile twitters.
Now, apply that all to a marketing model. Communication can mean $$. I guess that's why Twitter was able to raise 20 million on it's own.
The problem isn't the Twitterflood. If that was the case then sites like MySpace and Facebook would be going down on a daily basis. If it DOES do what OM Malik suggests, then Twitter needs to look at their internal structure. Not at Robert Scoble, or Leo Laporte.
Limits and subscription fees are a great way to kill the idea. Some will pay for it, while others will say "See ya". Twitter will fall like a ball of flame into the Pacific Ocean.
They keep the idea fresh. To most, Twitter is an "Oh, I heard of that". People might know about it, but never signed up. Oncemore, Twitter can easily become a cash cow. The data that comes into Twitter is like when Daffy Duck found the Sultan's cave.
I'M RICH! I'M WEALTY BEYOND MY WILDEST DREAMS!!!...
Keep going Scoble. I'm listening...
I do know how Twitter is built. However, "back in the day," I was the development manager for a real-time, stock quote delivery system, so I do have some experience with architectural issues Twitter may be facing.
Let's look at the procedure Robert refers to as "remixes them." In the simplest architecture, there would be a single list (database, flat file, etc) of all the twitters created by everyone stored in chronological order. You may, as a storage optimization, just store a user id with the twitter string, and tweet time stamp (aka a tweet).
In this single architecture, a "remix" would require a query across all the tweets for a period of time for all people that a user follows. This query would be fairly fast when the number of tweets in the specified period of time is fairly small, and the number of users a person follows is fairly small. You can see that this type of query becomes more expensive when the number of users you follow increases and the overall number of tweets per period increases.
So to speed up this query, you could build some kind of index based on users. But maintaining this index would become expensive, especially during high incoming tweet periods.
So one might try to optimize this architecture by breaking up the universal store into list of tweets per person. Now each incoming tweet can be easily added to the user's tweet list.
Then the "remix" of tweets of the people you follow would require a join across each list and then sorted by chronological order. This would become increasingly more expensive when a user starts to increase the number of people they follow. It would be particularly expensive for super users who follow lots of users.
A reasonable compromise might be to keep a single universal stream of tweets in chronological order and two lists for each user: a list of pointers of all their tweets, and a list of pointers to all the tweets from the people they follow.
Maintaining these three lists would look something like: sender publishes a tweet, it is added to the universal store, a pointer is then added to the sender's tweet list, and then "push the tweet to followers" by walking sender's list of followers and add a pointer to the tweet to each "follow" list.
This approach scales fairly well. It allows the act of updating the follow lists to be partitioned across multiple servers. Each server can just take (using shared queues) a tweet from the universal store and "fan it out" to the appropriate followers. It also can be separates the operation from the inbound tweet processing.
To optimize the "fan it out" process, messaging publish and subscribe product like JMS or TIBCO Rendezvous and broadcast the tweets to the servers that manage follow lists. This would require a universal store process to publish all tweets and a cloud of follow list managers listening (aka subscribing) to tweet broadcasts updates for each followed person.
This approach also nicely addresses Twitter's need to separate outbound follower queues for users that have requested point to point delivery of messages via Instant Messaging and SMS.
For further scaling optimzation, you can have several tweet stores instead of one single universal store. You just need to ensure that all incoming tweets from a particular user are added to the same store to maintain ordered delivery to followers.
So it is quite reasonable to copy (at least references) each of Robert's tweets 25,000 times, just do so in a scalable manner.
I must say that I have not really cared enough to read your column in the past. After reading this post, however, I will make sure never to read anything else you choose to write. What I see is a person that is clearly ignorant about a complex set of topics related to application design and scalability, speaking sophomorically about them. Perhaps you should take some time, in your case, a great deal of it, and educate yourself about these matters before speaking great volumes of nonsense relating to the technical implementation of this or any other application.
Start by reading the comments here and questioning some of the very smart people, who have graciously taken the time to try and educate you. Please, for the sake of the thousands of people that clearly believe you to be an authority on matters of technology, stop this idiocy.