March 31, 2018

How to use linux perf tools and save dollars - Part 2

Flame Graph - Introduction

In our previous discussion (I strongly recommend to go through the linked blog), we have seen how in real world, linux perf tool would be helpful in writing a very well optimized code. We have seen an example of how we could save loads of CPU cycles in the example linked. 

Now we will see how to visualize the output which would help us to find the bottle-neck super fast!!!

What is flame graph?

Flame graph is a tool which converts the binary data file from perf tool into a "*.svg" a more comfortable visualized format. In fact, to put it more generically flame graph is a tool which helps to represent different profile data - like perf tool, Dtrace etc in a visual format.


Ok... Tell me how to use it

Step - 1

Install the flame and generate perf data dump. For this, we are using the example in previous discussion - and generate the perf data dump. Check the below GIF which provides the steps executed - 

Install Flame Graph and using perf generate the perf dump

Step - 2

Follow the steps shown in the below GIF to convert the dump into a "*.svg" file which can be opened in any browser and analyzed.

Convert perf data to flame graph

Step - 3

Analysis of the graph

  • In the GIF, you can see that the graph generated most of the stack (represents CPU cycles) is related to usleep system call.
  • Also, the usleep is triggered by consumer call-back function of the consumer thread.
  • Compare the total area of the consumer and producer we would be clearly identifying the total CPU consumed by consumer is far more than producer.
This immediately helps us concluded that - consumer slow is excessively slow and is due to the function usleep in the consumer call back function.

Óyela.... we will get all required answer!!!!

Conclusion

  • Use the flame graph tool very effectively, (even in multi-threaded applications also) to visualize your application bottle-neck and fix them.
  • Both the long term costs saved, and time to debug will be sufficiently reduced with this great tool.
Keeping looking into this space, you would get much more information on this tool.
Would be happy to hear you out and help you... Please leave your opinion in the comment section.

And a closing thought...



How to find linux kernel version in your distros

How do you find the linux kernel version you are running?

Run the below command to find out the Linux kernel version.

uname -r

In my distros, I get the below response

~/ $ uname -r
3.13.0-142-generic
~/ $



Here is a short clip on how to run and get the details.
Added bonus, I have given more details on other options for the same command.

How to find linux kernel version


March 30, 2018

SBT run/test for every edit automatically

What is it about?

Imagine you have to do huge number of smaller edits to your source code in scala and verify the output. Most of the time, in other languages especially, you end up doing lots of manual work that consumes lots of repeated boring time.

So, we end up taking some time to automate them - change -> build -> compile -> run. Which is smartly handled in scala.

Ok, tell me about the savior

~ (Tilde) is our savior. 

Run the program with prefix of ~ to ensure it continuously polls for the change and executes it when detected one.

Like :


sbt> ~run


Ok, show me a small clip with example

Here you go

GIF on how to use tilde and run on-the-go


Similarly, you can do the same for testOnly command to build your test cases.

Here you go for the test suite example

GIF on how to use tilde and run test cases on the go


Wonderful, ain't it?
Let me know your thoughts in the comment section.

March 29, 2018

How to find which flavor and which version of JDK do I have on my Ubuntu

Command to find the file containing the jdk information

Command:

file `which java javac`

Output:


~/ $ file `which java javac`
/usr/bin/java:  symbolic link to `/etc/alternatives/java'
/usr/bin/javac: symbolic link to `/etc/alternatives/javac'


Command to find the jdk version information

Command and Output


~/ $ file /etc/alternatives/java /etc/alternatives/javac
/etc/alternatives/java:  symbolic link to `/usr/lib/jvm/java-8-oracle/jre/bin/java'
/etc/alternatives/javac: symbolic link to `/usr/lib/jvm/java-8-oracle/bin/javac'
~/ $

Now, you are having Java/JDK version 8 of oracle.

How to use linux perf tool and save dollars

When does this tool help me?

Ask yourself if you have any of the following problem
  1. Does you application built on linux run slow? And want to find out the reason?
  2. Do you want to be cost-effective and don't want to buy any high licensed profiling product?
  3. Do you want a reliable and accurate profiling tool in the linux distros?
If you have a "yes" for most of the above questions, then this post should help you provide a gateway into the "easy world of profilers".

What does this tool do?

  • It is a tool available in linux distros (Linux 2.6+) (How to find linux kernel version?) which abstracts CPU hardware differences and provides a simple way to instrument/profile your code. 
          In plain terms, it is a tool which is helpful in finding out any possible bottle-necks in your applications.
  • Unlike tools such as purecoverage/purify/quanitfy series, perf does not need your application to be recompile.
          In plain terms, this tool does not add any additional code (called instrumenting) overhead and can run on precompiled binary. It derives all required metrics from the linux low level system parameters thus saving your day.

         Another major worry for any instrument code is that, during the process of instrumenting code, it slows down your application thus sometimes takes the real essence of the tool. perf stands out as one of the big guy, who does not add much overhead to your application during profiling.

OK, Show me some examples (I mean real example)

Consider the below consumer producer problem - where
  • Producer puts data to mutex protected buffer every 1 second.
  • Consumer consumes the data only when it reaches 10000 messages (It does not want to process every message, it would want to take data in batches.
  • And we have 100 consumer - producer pair running in our application


  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
/ for sleep
#include <unistd.h>

// streams
#include <iostream>
#include <sstream>

// pthread
#include <pthread.h>

// memcpy
#include <string.h>

// strings
#include <string>
#include <vector>

using std::cout;
using std::endl;
using std::string;
using std::vector;
using std::stringstream;

#define INIT_COUNT_BUF_SIZE 4096
#define ITER 10

class Mutex {
public:
    Mutex() {
        pthread_mutex_init(&m, NULL);
    }
    ~Mutex() {
        pthread_mutex_destroy(&m);
    }

    void Lock() {
        pthread_mutex_lock(&m);
    }

    void Unlock() {
        pthread_mutex_unlock(&m);
    }

    bool Trylock() {
        return (pthread_mutex_trylock(&m) == 0);
    }
private:
    pthread_mutex_t m;
};

class SyncedBuffer {
public:
    SyncedBuffer()  {
        mBuffer.reserve(INIT_COUNT_BUF_SIZE);
    }

    virtual ~SyncedBuffer() {
    }

    size_t Size() {
        Lock();
        size_t ret = mBuffer.size();
        Unlock();
        return ret;
    }

    void Write(const char* src, size_t n) {
        Lock();

        mBuffer.push_back(string((char*)src, n));

        Unlock();
    }

    size_t Read(char* dst, size_t n) {
        Lock();

        memcpy((void*)dst, mBuffer.back().c_str(), mBuffer.back().size());
        size_t copy_size = mBuffer.back().size();
        mBuffer.pop_back();

        Unlock();

        return copy_size;
    }

    void Reset() {
        Lock();

        mBuffer.clear();

        Unlock();
    }

protected:
    void Lock() {
        m.Lock();
    }

    void Unlock() {
        m.Unlock();
    }


    vector<string> mBuffer;
    Mutex m;

};

void* producer(void *ptr) {
    string prefix("");

    SyncedBuffer *b = (SyncedBuffer*)ptr;

    cout << prefix << "producer started" << endl;

    string data = "simple data string #";

    do{
        static int i = 1;
        stringstream ss;
        ss << data << i++;
        string s = ss.str();
        b->Write(s.c_str(), s.size());
        cout<<" P "<<b->Size()<<endl;
        usleep(1000000);                              //Producer pumps data every 3 seconds.
        //if(i % 100 == 0) cout<<"Producer count"<<i<<endl;
    } while(true);

    cout << prefix << "producer finished" << endl;

    return (void*)0;
}

void* consumer(void *ptr) {
    string prefix("\t\t\t\t\t\t");

    SyncedBuffer *b = (SyncedBuffer*)ptr;

    char data[INIT_COUNT_BUF_SIZE];

    cout << prefix << "consumer started" << endl;

    size_t read = 0;

    do{
        static int i = 1;
        if(b->Size() <= 10000 )
        {
            usleep(1);                              //Bad code in consumer to sleep very less time
        }
        else
        {
            read = b->Read(data, INIT_COUNT_BUF_SIZE);
            string s;
            s.append(data, read);
            i++;
        }
    } while(true);

    cout << prefix << "consumer finished" << endl;

    return (void*)0;
}

int main() {
    const int COUNT = 100;
    SyncedBuffer b[COUNT];

    pthread_t prod_thread[COUNT];
    pthread_t cons_thread[COUNT];

    for (int i = 0; i < COUNT; i++)
        pthread_create(&prod_thread[i], NULL, producer, (void*)&(b[i]));

    sleep(2); // seconds

    for (int i = 0; i < COUNT; i++)
        pthread_create(&cons_thread[i], NULL, consumer, (void*)&(b[i]));

    for (int i = 0; i < COUNT; i++)
    {
        pthread_join(prod_thread[i], NULL);
        pthread_join(cons_thread[i], NULL);
    }

}


Now you can compile the above snippet in with below command line option


1
g++ -o a.out buffer.cpp -Wall -l pthread

Strangely, when you run the top report for the above program, it would look like below


1
2
3
4
5
6
7
8
top - 13:30:31 up 164 days,  3:59,  2 users,  load average: 0.83, 0.75, 0.39
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.9 us,  5.7 sy,  0.0 ni, 93.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  32767840 total, 28525000 used,  4242840 free,   519348 buffers
KiB Swap: 32772092 total,   187724 used, 32584368 free 25386684 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
23453 root      20   0 5786032   4420   2236 S 237.8  0.0   0:44.46 a.out

A strange thing you would identify is that the CPU taken by processor is 237.8 %.... Howdy!!!
Now let's see if our perf tool can find out the culprit....

Running the tool

Command to run (I will explain more about the parameters in later blog) to record the profiler information


1
perf record --call-graph dwarf -F 99 ./a.out  -g sleep 120

Command to peep-into the inner happenings in the tool


1
perf report -n -i perf.data

The output produced is.... - there goes my first clue!!!!


1
2
3
4
5
6
7
8
9
                                                 ┌─Warning:───────────────────────────────┐
                                                 │Processed 7149 events and lost 6 chunks!│
                                                 │                                        │
                                                 │Check IO/CPU overload!                  │
                                                 │                                        │
                                                 │                                        │
                                                 │                                        │
                                                 │Press any key...                        │
                                                 └────────────────────────────────────────┘

It clearly tell us that there is a incorrect code-base which is causing a huge CPU overload (bingo it matches our top report).

Now, if you dis-assemble the report, you will get the below details


Samples: 6K of event 'cycles:pp', Event count (approx.): 76746912353054071
  Children      Self       Samples  Command  Shared Object        Symbol                                                                    
+   96.09%     0.09%             6  a.out    a.out                [.] consumer                                                              
+   92.53%     3.55%            30  a.out    libc-2.19.so         [.] __GI___libc_nanosleep                                                 
-   92.37%     0.00%             1  a.out    libc-2.19.so         [.] usleep                                                                
   - usleep                                                                                                                                 
      - 92.37% consumer                                                                                                                     
         - start_thread                                                                                                                     
           __clone                                                                                                                          
        0.00% producer

hmmm... so consumer has a usleep which consumes almost all CPU.

Conclusion

  • The micro second sleep in consumer (callback function) is very ineffective.
  • Most of the CPU utilization is gone in sleep, wake-up from the kernel and again sleeping. This un-necessary ineffective code is causing us precious CPU cycles being wasted which otherwise could have been used for other purpose.
Extend this line of thinking a bit-further, assume you have such in-effective application running on AWS and you paying precious $$$ for the bad code. A HUGE loss of revenue which could be caught just by running the same application with this tool and save your life....

Home-work : Run the same application, by changing the usleep(1) in consumer - Line 150 to usleep(1000000). You would see the change.

hmmm, It looks interesting. What more can I do with it?

A lot more... wait for my next blog
In the meanwhile, you can comment on any such similar free tools which can potentially catch such problem. :)

Further Reading...

How to use flame-graph?

And a closing thought....



October 29, 2011

Static/global variable and shared library behaviour

[Note : This would be a very long discussion and reader would need considerable amount of time to complete the reading]

Problem Description
Let's take an example of a class having a static variable or a global variable.  

Header file : x.h
extern int g_s;

class XYZ
{
        static int s;
    
    private:
        static int getInt();
};
Implementation file:x.cpp


int XYZ::s = 777;
int g_s = 7;
int XYZ::getInt()
{
  return s;
}

Now let us go through below questions (Little Quiz) and answer them
  • What is the scope of global variable (g_s) and static class variable (XYZ::s)? 
    • Naturally, the answer would be global. 
    • For class static data member variable, the scope would be global with additional limitation of privilage (i.e all objects belonging to XYZ in the executable will have only one copy of s and only accessible for class interface).
  • How many copies of the variable exists? 
    • One 
Are the above answers really correct? What happens in case of executable involving multiple shared libraries (so's) - each having object files of x (i.e x.o)? 
We will explore these and come back questions (Little Quiz) and re-answer them in the end.

Description

Consider that you are building and executable a.exe.
Now this executable consists of many objects and libraries - as shown in below diagram.




  1. libXYZ.so consists of 3 objects X.o, Y.o and Z.o 
  2. libVWX.so consists of 3 objects V.o, W.o and X.o 
  3. libWXY.so consists of 3 objects W.o, X.o and Y.o 
  4. and, the executable a.exe consists of  
    1. objects X.o, M.o and A.o and  
    2. shared libraries (linked using -L/-l option in make file) libXYZ.so, libVWX.so and libWXY.so. 
From the diagram,you can see that there are 4 copies of the x.o present in the final executable (represented by 4 copies of x.o object). Thus in the final executable, there are 4 copies of the static variable X::s in the system.
  • Does Building the executable like the above throw linker error during compilation (complaining of multiple definitions)? 
    • Surprisingly No!!!. From a detailed study I have done, the multiple definition is determined based on certain rule set - described below.
    • If the same symbol is found among the peers of objects and/or static library then it is flagged as error during compilation.For example
      • if and only if X.o and M.o contains same symbol, then it is flagged as an error - in case of bulding executable.
      • if and only if X.o and Y.o contains same symbol, then it is flagged as an error - in case of bulding a shared library.   
    •  During linking (last stage of execution building), the shared library symbols are not resolved. That means, linker will not go into the symbol definitions present in any of the linker - in case it is already present in the object file of the executable.
  • Now let us come to the most interesting fact, In such circumstances, which symbol is used? 
    • It depends on the OS. 
        SUSE 10 Linux: 
        Here, the first encountered symbol is always placed in the Global Symbol Table (GST).
        So during execution symbol resolution from any of the library, the GST is dipped into and     would always gets the first symbol - leading to always pointing to same copy of the variable. 

        HP-UX:
        Here, when the execution flows through a library, it would refer to its "global" variable in the scope of its library. This can be visualized as GST containing global variable with the resolution of the library name also.
 
       This is contrary to SUSE Linux.
         

In short, SUSE Linux seems to have a single copy of reference for the global variable across all the libraries, overlapping objects in the executable. However, HPUX seems to have global copy for each library and it would be referenced during the execution. 
  • Doesn't it lead to different kind of behavior in both SUSE and HPUX OS for the same machine?
    • Yes, Try it out!!!

  • How to avoid this kind of problem? 
    • The main problem in this example, is the use of global objects/static global variables in multiple shared libraries and also in the main object file. The general expectation is that the linker to complain the problem - but it does not.
So only other option is to change the libraries into static library and link to the executable, if no error is thrown, relink with shared libraries.
   
Conclusion 

Now let us come back to the little quiz (discussed in the beggining).
  • What is the scope of global variable (g_s) and static class variable (XYZ::s)?
    • (Previous answer)
      • Naturally, the answer would be global. 
      • For class static data member variable, the scope would be global with additional limitation of privilage (i.e all objects belonging to XYZ in the executable will have only one copy of s). 
    • (Now)
      • Global, with the scope of shared library (while building ".so") or executable. 
      • Symbol resolution is done at peer object (.o) or static library (.a).  
      • Recursive symbol resolution is not done among linked shared library to find out conflict resolution.
  • How many copies of the variable exists? 
    • (Previous answer) One 
    • (Now) One per the linkage scope (i.e shared library in this case, or executable for main object file) 
Regards,
Tech Unravel
Supreme Debugging

October 4, 2011

[C++] Additional parenthesis are always safe! Really?

Background


In general, a programmer would be advised to provide additional parenthesis in the code "when in doubts" with a presumption that it is safe. We will see on example here on how in certain scenarios it cannot be safe and programmer needs to be extra careful.

Consider this code (which does nothing but puts a map of class pointer index versus class pointer index)

Sample 1
1:  #include <iostream>  
2:  #include <unistd.h>  
3:  #include <map>  
4:  using namespace std;
.
.
51:  int main()  
52:  {  
53:       const unsigned int MAX_COUNT = 20000;  
54:       map <I*,J*> contextHolder;  
55:       char *p[MAX_COUNT];  
.
.
. 
62:       map <I*,J*>::iterator itr = contextHolder.begin();  
63:       map <I*,J*>::iterator itrEnd = contextHolder.end();  
64:       while(itr != itrEnd)  
65:       {  
66:            if(someCondition)  
67:            {  
68:                 itr->first->f();  
69:                 itr->second->f();  
70:                 I* pTempI = itr->first;  
71:                 J* pTempJ = itr->second;  
72:                 delete pTempI;  
73:                 delete pTempJ;  
74:                 contextHolder.erase(itr);  
75:            }  
76:            itr++;  
77:       } 
.
.
.
82:       return 0;  
83:  } 


Now, let us discuss the code which is highlighted.
contextHolder.erase(itr);

This would lead to well known problem - crash,
  • Line 74 deletes the contents pointing to itr - thereby invalidating the pointer.
  • Line 76 increments the invalidated pointer - thus causing the crash.
to solve this problem, developers are known to add the iteration increment during the erase operation itself. Like

Sample 2
64:       while(itr != itrEnd)  
65:       {  
66:            if(someCondition)  
67:            {  
68:                 itr->first->f();  
69:                 itr->second->f();  
70:                 I* pTempI = itr->first;  
71:                 J* pTempJ = itr->second;  
72:                 delete pTempI;  
73:                 delete pTempJ;  
74:                 contextHolder.erase(itr++);  
75:            }  
76:          else  
77:          {  
78:           //do something else.  
79:             itr++;  
80:          }  
81:       }  

The fix in line 74 (In sample 2 versus sample 1) solves the problem with an increment operator for the iterator (because increment is done before invalidating the iterator).  

Detailed Description


All is well till now. But assume that a amateur programmer, who is confused about the working of increment operation in a complex statement like that of Sample 2, line 74 adds an addition braces for the increment iterator.


Sample 3
64:       while(itr != itrEnd)  
65:       {  
66:            if(someCondition)  
67:            {  
68:                 itr->first->f();  
69:                 itr->second->f();  
70:                 I* pTempI = itr->first;  
71:                 J* pTempJ = itr->second;  
72:                 delete pTempI;  
73:                 delete pTempJ;  
74:          contextHolder.erase((itr++));  
75:            }  
76:          else  
77:          {  
78:           //do something else.  
79:             itr++;  
80:          }  
81:     

We see that the application crashes at line 74.

Analysis 


Now this seemingly innocent braces (highlighted in red) looks to do more harm than good. This is a classical coding problem - the iterator is extended (incremented) beyond the boundary (i.e itr.end() ) and then the map tries to erase the same element - "causing crash".

So adding this additional parenthesis lead to a disastrous bug. Moreover, hard to debug/difficult to reoccur.Any static check tool also will mostly will fail to unearth the problem.

Take Away's 


  • These kind of problems are very very difficult to catch at any cycle of SDLC phase. Even a reviewer (with immense experience) also will find difficult to catch these kind of issues.

  • It is best to share these kind of unique issues which occur in projects as technical sharing across the organisation - so that developers are sensitised (adj : having an allergy or peculiar or excessive susceptibility :)), to such potential problem in code.

  • It is best for the developer not to blindly follow "any rule", but to understand the concept behind the logic and how compiler interprets the code - in this case - the priority of execution in the complex execution statement.

Regards,
Tech Unravel
Supreme Debugging