When does this tool help me?
Ask yourself if you have any of the following problem
- Does you application built on linux run slow? And want to find out the reason?
- Do you want to be cost-effective and don't want to buy any high licensed profiling product?
- Do you want a reliable and accurate profiling tool in the linux distros?
If you have a "yes" for most of the above questions, then this post should help you provide a gateway into the "easy world of profilers".
What does this tool do?
- It is a tool available in linux distros (Linux 2.6+) (How to find linux kernel version?) which abstracts CPU hardware differences and provides a simple way to instrument/profile your code.
- Unlike tools such as purecoverage/purify/quanitfy series, perf does not need your application to be recompile.
In plain terms, this tool does not add any additional code (called instrumenting) overhead and can run on precompiled binary. It derives all required metrics from the linux low level system parameters thus saving your day.
Another major worry for any instrument code is that, during the process of instrumenting code, it slows down your application thus sometimes takes the real essence of the tool. perf stands out as one of the big guy, who does not add much overhead to your application during profiling.
OK, Show me some examples (I mean real example)
Consider the below consumer producer problem - where
- Producer puts data to mutex protected buffer every 1 second.
- Consumer consumes the data only when it reaches 10000 messages (It does not want to process every message, it would want to take data in batches.
- And we have 100 consumer - producer pair running in our application
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | / for sleep #include <unistd.h> // streams #include <iostream> #include <sstream> // pthread #include <pthread.h> // memcpy #include <string.h> // strings #include <string> #include <vector> using std::cout; using std::endl; using std::string; using std::vector; using std::stringstream; #define INIT_COUNT_BUF_SIZE 4096 #define ITER 10 class Mutex { public: Mutex() { pthread_mutex_init(&m, NULL); } ~Mutex() { pthread_mutex_destroy(&m); } void Lock() { pthread_mutex_lock(&m); } void Unlock() { pthread_mutex_unlock(&m); } bool Trylock() { return (pthread_mutex_trylock(&m) == 0); } private: pthread_mutex_t m; }; class SyncedBuffer { public: SyncedBuffer() { mBuffer.reserve(INIT_COUNT_BUF_SIZE); } virtual ~SyncedBuffer() { } size_t Size() { Lock(); size_t ret = mBuffer.size(); Unlock(); return ret; } void Write(const char* src, size_t n) { Lock(); mBuffer.push_back(string((char*)src, n)); Unlock(); } size_t Read(char* dst, size_t n) { Lock(); memcpy((void*)dst, mBuffer.back().c_str(), mBuffer.back().size()); size_t copy_size = mBuffer.back().size(); mBuffer.pop_back(); Unlock(); return copy_size; } void Reset() { Lock(); mBuffer.clear(); Unlock(); } protected: void Lock() { m.Lock(); } void Unlock() { m.Unlock(); } vector<string> mBuffer; Mutex m; }; void* producer(void *ptr) { string prefix(""); SyncedBuffer *b = (SyncedBuffer*)ptr; cout << prefix << "producer started" << endl; string data = "simple data string #"; do{ static int i = 1; stringstream ss; ss << data << i++; string s = ss.str(); b->Write(s.c_str(), s.size()); cout<<" P "<<b->Size()<<endl; usleep(1000000); //Producer pumps data every 3 seconds. //if(i % 100 == 0) cout<<"Producer count"<<i<<endl; } while(true); cout << prefix << "producer finished" << endl; return (void*)0; } void* consumer(void *ptr) { string prefix("\t\t\t\t\t\t"); SyncedBuffer *b = (SyncedBuffer*)ptr; char data[INIT_COUNT_BUF_SIZE]; cout << prefix << "consumer started" << endl; size_t read = 0; do{ static int i = 1; if(b->Size() <= 10000 ) { usleep(1); //Bad code in consumer to sleep very less time } else { read = b->Read(data, INIT_COUNT_BUF_SIZE); string s; s.append(data, read); i++; } } while(true); cout << prefix << "consumer finished" << endl; return (void*)0; } int main() { const int COUNT = 100; SyncedBuffer b[COUNT]; pthread_t prod_thread[COUNT]; pthread_t cons_thread[COUNT]; for (int i = 0; i < COUNT; i++) pthread_create(&prod_thread[i], NULL, producer, (void*)&(b[i])); sleep(2); // seconds for (int i = 0; i < COUNT; i++) pthread_create(&cons_thread[i], NULL, consumer, (void*)&(b[i])); for (int i = 0; i < COUNT; i++) { pthread_join(prod_thread[i], NULL); pthread_join(cons_thread[i], NULL); } } |
Now you can compile the above snippet in with below command line option
1 | g++ -o a.out buffer.cpp -Wall -l pthread |
Strangely, when you run the top report for the above program, it would look like below
1 2 3 4 5 6 7 8 | top - 13:30:31 up 164 days, 3:59, 2 users, load average: 0.83, 0.75, 0.39 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.9 us, 5.7 sy, 0.0 ni, 93.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 32767840 total, 28525000 used, 4242840 free, 519348 buffers KiB Swap: 32772092 total, 187724 used, 32584368 free 25386684 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23453 root 20 0 5786032 4420 2236 S 237.8 0.0 0:44.46 a.out |
A strange thing you would identify is that the CPU taken by processor is 237.8 %.... Howdy!!!
Now let's see if our perf tool can find out the culprit....
Running the tool
Command to run (I will explain more about the parameters in later blog) to record the profiler information
1 | perf record --call-graph dwarf -F 99 ./a.out -g sleep 120 |
Command to peep-into the inner happenings in the tool
1 | perf report -n -i perf.data |
The output produced is.... - there goes my first clue!!!!
1 2 3 4 5 6 7 8 9 | ┌─Warning:───────────────────────────────┐ │Processed 7149 events and lost 6 chunks!│ │ │ │Check IO/CPU overload! │ │ │ │ │ │ │ │Press any key... │ └────────────────────────────────────────┘ |
It clearly tell us that there is a incorrect code-base which is causing a huge CPU overload (bingo it matches our top report).
Now, if you dis-assemble the report, you will get the below details
Samples: 6K of event 'cycles:pp', Event count (approx.): 76746912353054071 Children Self Samples Command Shared Object Symbol ◆ + 96.09% 0.09% 6 a.out a.out [.] consumer ▒ + 92.53% 3.55% 30 a.out libc-2.19.so [.] __GI___libc_nanosleep ▒ - 92.37% 0.00% 1 a.out libc-2.19.so [.] usleep ▒ - usleep ▒ - 92.37% consumer ▒ - start_thread ▒ __clone ▒ 0.00% producer
hmmm... so consumer has a usleep which consumes almost all CPU.
Conclusion
- The micro second sleep in consumer (callback function) is very ineffective.
- Most of the CPU utilization is gone in sleep, wake-up from the kernel and again sleeping. This un-necessary ineffective code is causing us precious CPU cycles being wasted which otherwise could have been used for other purpose.
Extend this line of thinking a bit-further, assume you have such in-effective application running on AWS and you paying precious $$$ for the bad code. A HUGE loss of revenue which could be caught just by running the same application with this tool and save your life....
Home-work : Run the same application, by changing the usleep(1) in consumer - Line 150 to usleep(1000000). You would see the change.
Home-work : Run the same application, by changing the usleep(1) in consumer - Line 150 to usleep(1000000). You would see the change.
hmmm, It looks interesting. What more can I do with it?
A lot more... wait for my next blog
In the meanwhile, you can comment on any such similar free tools which can potentially catch such problem. :)
No comments:
Post a Comment