Supreme Debugging: How to use linux perf tool and save dollars

When does this tool help me?

Ask yourself if you have any of the following problem

Does you application built on linux run slow? And want to find out the reason?
Do you want to be cost-effective and don't want to buy any high licensed profiling product?
Do you want a reliable and accurate profiling tool in the linux distros?

If you have a "yes" for most of the above questions, then this post should help you provide a gateway into the "easy world of profilers".

What does this tool do?

It is a tool available in linux distros (Linux 2.6+) (How to find linux kernel version?) which abstracts CPU hardware differences and provides a simple way to instrument/profile your code.

In plain terms, it is a tool which is helpful in finding out any possible bottle-necks in your applications.

Unlike tools such as purecoverage/purify/quanitfy series, perf does not need your application to be recompile.

In plain terms, this tool does not add any additional code (called instrumenting) overhead and can run on precompiled binary. It derives all required metrics from the linux low level system parameters thus saving your day.

Another major worry for any instrument code is that, during the process of instrumenting code, it slows down your application thus sometimes takes the real essence of the tool. perf stands out as one of the big guy, who does not add much overhead to your application during profiling.

OK, Show me some examples (I mean real example)

Consider the below consumer producer problem - where

Producer puts data to mutex protected buffer every 1 second.
Consumer consumes the data only when it reaches 10000 messages (It does not want to process every message, it would want to take data in batches.
And we have 100 consumer - producer pair running in our application

/ for sleep
#include <unistd.h>

// streams
#include <iostream>
#include <sstream>

// pthread
#include <pthread.h>

// memcpy
#include <string.h>

// strings
#include <string>
#include <vector>

using std::cout;
using std::endl;
using std::string;
using std::vector;
using std::stringstream;

#define INIT_COUNT_BUF_SIZE 4096
#define ITER 10

class Mutex {
public:
    Mutex() {
        pthread_mutex_init(&m, NULL);
    }
    ~Mutex() {
        pthread_mutex_destroy(&m);
    }

    void Lock() {
        pthread_mutex_lock(&m);
    }

    void Unlock() {
        pthread_mutex_unlock(&m);
    }

    bool Trylock() {
        return (pthread_mutex_trylock(&m) == 0);
    }
private:
    pthread_mutex_t m;
};

class SyncedBuffer {
public:
    SyncedBuffer()  {
        mBuffer.reserve(INIT_COUNT_BUF_SIZE);
    }

    virtual ~SyncedBuffer() {
    }

    size_t Size() {
        Lock();
        size_t ret = mBuffer.size();
        Unlock();
        return ret;
    }

    void Write(const char* src, size_t n) {
        Lock();

        mBuffer.push_back(string((char*)src, n));

        Unlock();
    }

    size_t Read(char* dst, size_t n) {
        Lock();

        memcpy((void*)dst, mBuffer.back().c_str(), mBuffer.back().size());
        size_t copy_size = mBuffer.back().size();
        mBuffer.pop_back();

        Unlock();

        return copy_size;
    }

    void Reset() {
        Lock();

        mBuffer.clear();

        Unlock();
    }

protected:
    void Lock() {
        m.Lock();
    }

    void Unlock() {
        m.Unlock();
    }


    vector<string> mBuffer;
    Mutex m;

};

void* producer(void *ptr) {
    string prefix("");

    SyncedBuffer *b = (SyncedBuffer*)ptr;

    cout << prefix << "producer started" << endl;

    string data = "simple data string #";

    do{
        static int i = 1;
        stringstream ss;
        ss << data << i++;
        string s = ss.str();
        b->Write(s.c_str(), s.size());
        cout<<" P "<<b->Size()<<endl;
        usleep(1000000);                              //Producer pumps data every 3 seconds.
        //if(i % 100 == 0) cout<<"Producer count"<<i<<endl;
    } while(true);

    cout << prefix << "producer finished" << endl;

    return (void*)0;
}

void* consumer(void *ptr) {
    string prefix("\t\t\t\t\t\t");

    SyncedBuffer *b = (SyncedBuffer*)ptr;

    char data[INIT_COUNT_BUF_SIZE];

    cout << prefix << "consumer started" << endl;

    size_t read = 0;

    do{
        static int i = 1;
        if(b->Size() <= 10000 )
        {
            usleep(1);                              //Bad code in consumer to sleep very less time
        }
        else
        {
            read = b->Read(data, INIT_COUNT_BUF_SIZE);
            string s;
            s.append(data, read);
            i++;
        }
    } while(true);

    cout << prefix << "consumer finished" << endl;

    return (void*)0;
}

int main() {
    const int COUNT = 100;
    SyncedBuffer b[COUNT];

    pthread_t prod_thread[COUNT];
    pthread_t cons_thread[COUNT];

    for (int i = 0; i < COUNT; i++)
        pthread_create(&prod_thread[i], NULL, producer, (void*)&(b[i]));

    sleep(2); // seconds

    for (int i = 0; i < COUNT; i++)
        pthread_create(&cons_thread[i], NULL, consumer, (void*)&(b[i]));

    for (int i = 0; i < COUNT; i++)
    {
        pthread_join(prod_thread[i], NULL);
        pthread_join(cons_thread[i], NULL);
    }

}

Now you can compile the above snippet in with below command line option

1	g++ -o a.out buffer.cpp -Wall -l pthread

Strangely, when you run the top report for the above program, it would look like below

top - 13:30:31 up 164 days,  3:59,  2 users,  load average: 0.83, 0.75, 0.39
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.9 us,  5.7 sy,  0.0 ni, 93.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  32767840 total, 28525000 used,  4242840 free,   519348 buffers
KiB Swap: 32772092 total,   187724 used, 32584368 free 25386684 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
23453 root      20   0 5786032   4420   2236 S 237.8  0.0   0:44.46 a.out

A strange thing you would identify is that the CPU taken by processor is 237.8 %.... Howdy!!!
Now let's see if our perf tool can find out the culprit....

Running the tool

Command to run (I will explain more about the parameters in later blog) to record the profiler information

1	perf record --call-graph dwarf -F 99 ./a.out -g sleep 120

Command to peep-into the inner happenings in the tool

1	perf report -n -i perf.data

The output produced is.... - there goes my first clue!!!!

                                                 ┌─Warning:───────────────────────────────┐
                                                 │Processed 7149 events and lost 6 chunks!│
                                                 │                                        │
                                                 │Check IO/CPU overload!                  │
                                                 │                                        │
                                                 │                                        │
                                                 │                                        │
                                                 │Press any key...                        │
                                                 └────────────────────────────────────────┘

It clearly tell us that there is a incorrect code-base which is causing a huge CPU overload (bingo it matches our top report).

Now, if you dis-assemble the report, you will get the below details

Samples: 6K of event 'cycles:pp', Event count (approx.): 76746912353054071
  Children      Self       Samples  Command  Shared Object        Symbol                                                                    ◆
+   96.09%     0.09%             6  a.out    a.out                [.] consumer                                                              ▒
+   92.53%     3.55%            30  a.out    libc-2.19.so         [.] __GI___libc_nanosleep                                                 ▒
-   92.37%     0.00%             1  a.out    libc-2.19.so         [.] usleep                                                                ▒
   - usleep                                                                                                                                 ▒
      - 92.37% consumer                                                                                                                     ▒
         - start_thread                                                                                                                     ▒
           __clone                                                                                                                          ▒
        0.00% producer

hmmm... so consumer has a usleep which consumes almost all CPU.

Conclusion

The micro second sleep in consumer (callback function) is very ineffective.
Most of the CPU utilization is gone in sleep, wake-up from the kernel and again sleeping. This un-necessary ineffective code is causing us precious CPU cycles being wasted which otherwise could have been used for other purpose.

Extend this line of thinking a bit-further, assume you have such in-effective application running on AWS and you paying precious $$$ for the bad code. A HUGE loss of revenue which could be caught just by running the same application with this tool and save your life....

Home-work : Run the same application, by changing the usleep(1) in consumer - Line 150 to usleep(1000000). You would see the change.

hmmm, It looks interesting. What more can I do with it?

A lot more... wait for my next blog

In the meanwhile, you can comment on any such similar free tools which can potentially catch such problem. :)

Supreme Debugging

March 29, 2018

How to use linux perf tool and save dollars

When does this tool help me?

What does this tool do?

OK, Show me some examples (I mean real example)

Running the tool

Conclusion

hmmm, It looks interesting. What more can I do with it?

Further Reading...

And a closing thought....

No comments:

Post a Comment