We have a set of FMS's deployed on Amazon's EC2. One of the things we want to be able to do is automatically detect when we should start up another FMS instance. To do that, I've been looking for metrics I could measure on the local FMS box to help me identify "transition" points, e.g., when we should add capacity or remove excess capacity.
I ran some load testing to find out where the capacity limits of a particular box, but ran into a couple of problems:
* Traditional system metrics (cpu/memory/run queue length) did not do a great job of predicting when we'd hit a wall. Load was really the only thing that seemed to climb much and it was only at about 4 (on a 4-core box) when things went south.
* When we *did* hit a wall, it was a pretty sharp cliff. We seemed to be doing fine at 70+70 streams (~300kbps streams in reflected out) and at 75+75 streams, but when I went to 80+80 streams, BAM! Things just started unravelling. With very little in the way of error logs to indicate what might be happening. But all of the sudden, my counters for simultaneous streams/etc dropped down from 80ish to 20ish (I was still publishing 80 to the server).
I tried bumping up the EC2 instance size (under the theory that we were being bandwidth-capped or stream-capped), but didn't really see much difference.
I see two possibilities:
* We actually are being bandwidth- or stream- capped and going up to a bigger box didn't help
* There are a number of other metrics on the server I could look at that would have shown a gradual degradation.
Assuming the latter, does anyone have any suggestions for what metrics I might measure on the FMS to decide if we were starting to get loaded? For example, I've thought about comparing Stream.time to NetStream.time for streams I'm reflecting out of the server.
You can determine that your FMS server is overloaded when your server does any of the followi
ng ....phisically breaks down, runs out of ram, has no bandwidth, you have server side code that is inefficent or you have a client side file that is making 100 connections per client.
The goal here is to detect warning signs of load _before_ failure. I'm looking for a metric that would most closely match user subjective experience. Also, the check needs to be automatable (either via server-side action script or a separate program running on the box).
Well its like this you need to know how many users your server can handle before it fails there is no real way to load test it without haveing alot of people log into your application. FMS theoretically can handle as many users as you can throw at it before your hardware either fails or runs out of resources such as ram. So if you are unable to monitor your server and determine ok when say 100 more users log on it is going to fail at which point your only option is to say add a new server loadbalancer or edge server. or put a message up saying our site has failed please come back when we have bought more junk to support you. By the way facebook we have more users than you hahahah I say this because your site will probably never acheive more users than your server can handle unless you are using a desktop computer to run FMS.
We have a rough idea of how many streams our server can handle before it falls over (per my first message). To assert that we'll never have more than 75 simultaneously broadcasting users seems like a bit of a stretch given that you know nothing about our company or product.
The question here isn't about load testing. It is about measurable (preferably from server-side action script) metrics that we can use to determine whether FMS performance is starting to degrade. Ideally there would be a metric X (say, average delay between a stream and the relayed version of it sent along to another server) where the value of this metric would change as the server became loaded. So X=N at normal load and X=M at failure. Then, I can find a value between N and M that we'll use as a trigger to begin the process of adding another FMS to our farm. Right now, the only metric I have available is number of simultaneous streams. I'm hoping to get suggestions of other metrics that can be programmatically measured.
I've pretty much read the entire server side action script doc and have never seen anything like what you are looking for You probably need to use a hardware appliance to monitor the network so you can determine when you are about to have problems. However if you can Identify the variables you wish to monitor you might write a C++ application to alert you. Its possible to use a C++ application to authenticate users with FMS so you might also be able to write a C++ app and use it within FMS for your purposes. I don't have much exsperience doing this so I emphasize that this has medium chances for success. Waste your time with my suggestion at your own risk so to speak.
I don't expect a direct answer (e.g., "metric X from getStats()") as I also am quite familiar with the action script doc and wouldn't have asked the question if there was an obvious answer. Rather, I'm looking for expert input -- preferably from Adobe employees as I have to believe I'm not the first person to ask this sort of question and I'm hoping they might have some "best practice" input.
Language isn't relevant here. I have non-FMS system monitoring in place and I don't see any obvious nonlinear behavior show up in any of the system metrics I am measuring when we hit the wall. This is why I've turned to media metrics as a possible source. For example, one would expect a loaded server to have higher outbound jitter than a less-loaded server. What I'm looking for probably requires either a considerable amount of experience solving a related problem or internal knowledge of the FMS architecture. I'm hoping for someone with one or the other to take a stab at this.
EC2 provides no QOS guarentees. Your answer will be variable. That said, more than likely it is an ec2 vm or network resource being exhausted, and probably not FMS related, with the exception of FMS creating the load.
However, if there is something wrong in FMS, you would need to see what the logs show.
I don't remember exactly what the logs said, but I do remember that they were decidedly unhelpful.
I'm not really sure how a QoS guarantee (or lack thereof) applies here. In fact, I'm going the opposite direction. Instead of assuming that machine X can do Y things (and then measuring how close to Y I get), I want to monitor the "quality" of the things I'm doing -- under the assumption that I'll start doing those things badly when under load (and before it tips over).
For example, one of the two types of streams that arrive at our FMS is reflected to a Wowza box for downstream delivery, archiving, etc. At any point in time, I can see how far I am into the inbound stream and compare it to how far I am into the reflected stream (under the assumption that this will start to vary more widely as the FMS becomes loaded). Alternatively, I could just look at packet loss for the reflected streams (since I assume that FMS will start dropping packets for those outbound streams if it gets too far behind). But using either of these metrics is based on assumptions that are difficult to verify and I was hoping to hear from someone who has tried to solve this problem. OTOH I may just be breaking new ground here (but it seems unlikely).
As far as an EC2 resource limit being hit, you could very well be correct, but I'd like a way of verifying that. VM seems unlikely (load and/or cpu just aren't that high), but being bandwidth capped by Amazon is possible. If that is indeed the issue, then I may be stuck with a simple "total current bandwidth consumption" metric.
Thanks everyone for bringing this up and proceeding with some discussions. I would put up my quick thoughts on this one.
Any load testing , as mentioned would start with CPU and Memory metrics. So is for FMS.
For a live case, CPU usage, for default FMS configuration would be little high. This is because of the aggregate messages and other queues that are maintained. One can disable these (application.xml) to considerably reduce the CPU usage.
Memory starts increasing as more streams are being served, but it will get stabilized, in my experience, for 1200 connections, all playing a 500 kbps stream, i would expect a memory usage of somewhere around 2-3 GB. (i would confirm the numbers if needed for accuracy, later).
One other good thing to look for is Buffer Length on the subscribers. An abnormal increase in its value shows the server is unable to fill the buffer of the client well in time.
Another related option is to look for frequent NetStream.Buffer.Empty and Netstream.Buffer.Full codes, if they are coming up too fast, it means the buffer on the client side is emptied faster than what we want.
Latency is by far, the best identifier. Mark the deviation of the subscribers from the 'actual' live, queues and aggregation of messages will play a part here again.
There are core logs enabled for any system over load (more than 90%) of FMS CPU. Watch out for these logs. Till the point one wont find them, i am sure the FMS is doing good.
Another option to take a look at is fibers. You can either enable and disable them for perf differences.
In the end, there must be some benchmarking each one of us should do, in order to find the just_before_fail_state. We keep doing that internally, with lots of load and expecting it to crash
Just check this article: http://groups.google.com/group/LR-LoadRunner/browse_thread/thread/1c22 b011edc7f192?pli=1 and see if it helps
any free tools for this?
(LoadRunner seems to be HP tool and somewhat pricey)
Can we also use fmscheck like suggested here: http://www.richinternet.de/blog/index.cfm?entry=6EA082F4-A85E-FD95-A8A B8C7A1770D09A ?
And please, if yes, where and how should we check test success/failure?
Thank you in advance.