Contact me for VHDL or Verilog projects and assignments

Monday, January 24, 2011

What is pipelining? Explanation with a simple example in VHDL.

   A pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. In most of the cases we create a pipeline by dividing a complex operation into simpler operations. We can also say that instead of taking a bulk thing and processing it at once, we break it into smaller pieces and process it one after another.

   If you are aware of microprocessor architectures then you may know about instruction pipelining. In microprocessors for executing an instruction there are many intermediate stages like getting instruction from memory, decode the instruction, get any other required data from memory, process the data and finally write the result back to memory. Without a pipeline a single instruction has to fully go through all these stages before the next instruction is fetched from the memory. But if we apply the concept of pipelining in this case, when an instruction is fetched from memory, the previous instruction must have already decoded. Go through the wiki definition for instruction pipelining, if you are interested in knowing more about the background theory.

   In this article I am not going to implement an instruction pipeline. It is kind of complicated and I don't want to confuse readers who are just learning VHDL. The below VHDL code, simply(without pipelining) implements the equation (a*b*c*data_in). Note that a,b and c are constants here and the variable 'data_in' changes every clock cycle. The result of the calculation will be available at the port names 'data_out'.

library IEEE;

entity normal is
port (Clk : in std_logic;
        data_in : in integer;
        data_out : out integer;
        a,b,c : in integer
end normal;

architecture Behavioral of normal is

signal data,result : integer := 0;


data <= data_in;
data_out <= result;

--process for calcultation of the equation.
    if(rising_edge(Clk)) then
    --multiplication is done in a single stage.
        result <= a*b*c*data;
    end if;

end Behavioral;

The above code is nothing simple and easy to understand. I have written the pipelined version of the same design. Check it below:

library IEEE;

entity pipelined is
port (Clk : in std_logic;
        data_in : in integer;
        data_out : out integer;
        a,b,c : in integer
end pipelined;

architecture Behavioral of pipelined is

signal i,data,result : integer := 0;
signal temp1,temp2 : integer := 0;


data <= data_in;
data_out <= result;

--process for calcultation of the equation.
    if(rising_edge(Clk)) then
    --Implement the pipeline stages using a for loop and case statement.
    --'i' is the stage number here.
    --The multiplication is done in 3 stages here.
    --See the output waveform of both the modules and compare them.
        for i in 0 to 2 loop
            case i is
                when 0 => temp1 <= a*data;
                when 1 => temp2 <= temp1*b;
                when 2 => result <= temp2*c;
                when others => null;
            end case;
        end loop;
    end if;

end Behavioral;

So what have I done different? Our design required 3 multiplications and in the normal version I did it all at once. But if you see the above code, I am doing it stepwise. The equation was broken down into 3 different multiplications and each operation is done on a different clock edge. If you are wondering about the difference between the two codes see the RTL schematic of the two designs:

Normal code(without pipelining)
Pipelined code-check the extra flip flops
   As you can see, the normal code is implemented by connecting 3 multipliers in a cascaded fashion with a flip flop at the end stage. For the pipelined code, we have flip flops after each multiplier. What does this mean? The extra flip flops reduces the delay through the combinatorial logic and hence pipelined code can operate at a higher frequency than the normal code.

   The 'normal' code takes less time to write and is mostly straight forward. But if you want your design to offer the highest speed possible, you have to think out of the box! The 'pipelined' code is little bit complicated to write. In this case we had to use case statements and a for loop to implement a small equation. But it gives higher speed. In large projects pipelined designs are very important for some blocks since it may act as a bottleneck for the performance of the whole design.

   On the other side there is a small disadvantage for pipelined designs. They introduce a small number of  delay between input and output, in terms of clock cycle. For instance we have 3 stages in the pipelined code and hence the output comes only after 3 clock cycles, after the input is applied. But this disadvantage usually doesn't matter in most of the designs since after 3 clock cycles we can get continuous stream of output. This delay can be seen if you check the simulation waveforms of the two designs:
waveform for normal code-no delay at all.

waveform for pipelined code-3 clock cycle delay.

The testbench code used for testing the designs is given below. Remember to change the component name if you want to test the 'normal' entity.

USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_unsigned.all;

-- entity declaration for your testbench.Dont declare any ports here
ENTITY test_tb IS
END test_tb;

ARCHITECTURE behavior OF test_tb IS

   --declare inputs and initialize them
   signal clk : std_logic := '0';
   signal data_in,data_out,a,b,c : integer := 0;
   -- Clock period definitions
   constant clk_period : time := 10 ns;
    -- Instantiate the Unit Under Test (UUT)
     --Change the entity name below if you want to test the 'normal' entity.
   uut: entity work.pipelined port map (Clk => Clk,
        data_in => data_in,
        data_out => data_out,
        a => a,
        b => b,
        c => c

    a <= 1;
    b <= 2;
    c <= 5;
   -- Clock process definitions( clock with 50% duty cycle is generated here.
   clk_process :process
        clk <= '0';
        wait for clk_period/2;  --for 5 ns signal is '0'.
        clk <= '1';
        wait for clk_period/2;  --for next 5 ns signal is '1'.
   end process;
   -- Stimulus process
  stim_proc: process
        wait for Clk_period;
        data_in <= 1;
        wait for Clk_period;
        data_in <= 2;
          wait for Clk_period;
        data_in <= 3;
          wait for Clk_period;
        data_in <= 4;
          wait for Clk_period;
        data_in <= 5;
          wait for Clk_period;
        data_in <= 6;
          wait for Clk_period;
        data_in <= 7;
          wait for Clk_period;
        data_in <= 8;
          wait for Clk_period;
        data_in <= 9;
  end process;

END behavior;

Note:- The codes where designed and tested using the Xilinx Webpack version 12.1. The codes are also synthesisable. They should work with other tools too.

If the design is complex, then always identify and break down it into smaller steps. And implement it in using the pipeline concept. This will increase the maximum clock frequency, reduce the time to synthesis the code and will also increase the throughput of the system.


  1. hi thank you for your blog
    i love it and always follow your article

    i am newbie in VHDL
    i wrote code that have 3 stage
    in the first stage i use generate block
    second stage i use the component (port map)
    and the last again i use generate block
    i want use pipeline.
    if i say true, this code (generate block and port map) couldn't inside the process
    what can i do to write my code pipeline?!
    i apologize for my English language ;)

  2. @mohammed: You have got the wrong idea here. By stages in pipelining I didnt mean that "generate","component" etc as stages.

    A pipeline can only be implemented inside the process keyword. Pipeline is concerned with only logic.

    "component" etc are the declarations or telling what the program contains. Its like saying a man has 2 hands and 2 legs.It doesnt tell whether they are long,fat or black.

    In a vhdl program we define all these kind of things(I mean what you want to do, the logic) inside the process statement. And pipeline is used in such a case.

    If you want to use pipeline you have to first see whether you need it or not. If yes, breakdown your logic into different stages. There is no single pipelining strategy. It all depends on the design.

    I am not sure whether I clarified your doubt. Thanks for following my blog.

  3. thank you for your reply

    maybe i can not explain my problem
    i have 3 Section that work separately(i use SIG signal) and when data drive to my Design wait that result drive out and then accept new data in. if i clear SIG signal from my code, at the end of each section flip flop added?? (and work like pipeline)
    for clear my word i show you code
    my code is :
    Process (CLK) is
    If (Falling_Edge(CLK)) Then
    WHEN "00" =>
    WHEN "01" =>
    WHEN "10" =>
    WHEN "11" =>
    End If;
    End Process;

    ---------------------------------------- stage1 ------------------------------------------
    Stage1: for i in x'Range Generate
    Process (SIG) is

    if (SIG="01") then
    -- do something here and assign conclusion to A(i)

    A(i) <= (others => '0');

    end if;
    end process;

    end generate;
    ---------------------------------------- stage2 --------------------------------------------
    stage2: for i in x'Range Generate
    SF: Sigmoid_Function port map (A => A(i), B => B(i),CLK => CLK ,SIG=>SIG);
    end generate;
    ------------------------------------------ stage3 -----------------------------------------------
    stage3:process(CLK) is
    if ( Rising_Edge(CLK) and SIG = "10" ) then
    --use B(i) do something here
    end if;
    end process stage3;

  4. @mohammed: your code has many basic errors. I suggest you learn the basic vhdl before going for a pipelined design. Pipeline can be understood and applied only after understanding the vhdl basics properly.

  5. thank you for your help

    i know it is not true that i take your time
    this code work for me, but you are right
    would you tell me where is my basic error?
    because i cant find My Error

    thank you so much

  6. @mohammad : Inside case statement you are checking the SIG signal and changing it too.
    Similarly you have used generate statement with process statement too. I am not sure it will work.

    How do you know your code is working? Is it getting synthesised?

  7. yes it synthesized with no error

    (for SIG signal)i simulate code Separately (modelsim 6.5)and synthesized(ise 11.4) it
    the output show here:
    for used generate statement with process statement, it synthesized with no error !!!!

    my problem with pipeline solved
    i was wrong in my code (inside Process i have mistake)

  8. I would love to see "Pipelining part 2"! Maybe You have books/manuals/other info about pipelining in VHDL?

    P.S. Why don't You use numeric_std?

  9. I don't see the point of the loop and case statement. Remove them and you'll get the same results but with drastic simplification. Something like this:

    if rising_edge(Clk) then
    temp1 <= a*data;
    temp2 <= temp1*b;
    result <= temp2*c;
    end if;

  10. @foam : yes, I agree. But I wanted to write the code which is easily understandable. Simply putting the 3 statements may not get the idea through the readers. By using the case statement and for loop I wanted to show the step by step working.

  11. I don't understand how your pipelining example increased system performance. In both of your waveforms, "data_out" is available on the following clock edge. But the pipelined version has a 3-cycle delay, so the normal version is faster.

  12. @shaunee: In terms number of clock cycles normal version is faster. But once you synthesis and check the delays(or maximum freq used in the circuit) you can see that the end to end delay in pipelined circuit is less. Which means pipelined circuit can run with a higher freq of clock and so its faster.

  13. So is the only way to tell how many pipeline stages you should use, and/or how much these stages increase system performance is by trial-and-error of recoding and resynthesis?

  14. @shaunee : Not trial and error. Normally as you increase the number of stages in a pipeline the more the system speed increases.It is up to you to how many stages you can break up a complicated single step calculation or process.

  15. Hello. I am thinking about, what if that now ,
    i have 16-bit input data. how can i modify the code? do i need to declare each a b and c with 16 bit too? This question seems lame but I am a newbie. pardon me.
    good day

  16. Hello.. Can anyone send me the code for testing of SliceM in a single clb.........

  17. Hello,

    I don't understand the actual need for the use of pipelining...

    In the example, you multiply the signals: a*b*c*data (32 bit each) - Suppose that our clock runs at high enough frequncy. It might be that the product of the combinatorial circuit (a*b*c*data) won't be available soon enough (before the sampling clock edge) - This I understand well.

    Now, what I don't understand:
    after some delay and "settling time" - the correct result will be there. It will be "glitchy" - but only at start!
    After the "settling time" the product should be stable!

    So, why should we pipeline?
    Please help me understand

  18. Hi all,
    The idea of pipelining is processing smaller amounts of data in shorter periods. In this example you divide the product operation into three stages. Once the operation divided to stages the device can process each stage faster than the whole operation at a time. The result of each stage is stored (or registered) so it will be available for the next stage. You may experience an increased delay before you get the output but this is not as important as increasing the operation frequency and throughput of the (overall) design.The increased throughput is, again, due to the presence of registers : Instead of waiting for the whole operation to complete before starting another, useing pipelining allows you to start another operation before the previous one is complited. Hope, this will help a little bit to understand the concept of pipelining because I, myself, not so long ago was confused about it.


  19. I've thought up an example for anyone still wondering about the use of pipelining. Imagine you have the following circuit, performing a calculation on a stored value and storing the result:

    flipflip ---> combinatorial logic (10 gates) ---> flipflop

    Let's say the combinatorial logic is a sequence of 10 logic gates, each introducing a delay of 1 ns. This means that it takes 10 ns for a change in the first flipflop to reach the second, and therefore you can only clock this circuit at 100 MHz. It takes 1 clock cycle, or 10 ns, for data to reach the output, and you have a throughput of 1 value per cycle, or 1 value per 10 ns.

    Now, split the combinatorial logic in half:

    flipflop ---> logic (5 gates) ---> flipflop ---> logic (5 gates) ---> flipflop

    It now takes 5 ns for data from the first flipflop to reach the middle, and 5 from the middle to the end. The key is that these two paths are now independent, so you can clock this circuit at 200 MHz. The latency is now 2 clock cycles, but this is still 10 ns. In addition, you still get 1 value out per cycle, so your throughput has doubled to 1 value per 5 ns. Pipelining this circuit has doubled the clock speed and throughput without affecting the latency.

    In reality, the flipflops also introduce a delay, which means that the latency of the pipelined example will be greater than the non-pipelined. You will still get increased throughput, however.

  20. Oh god..!
    When you give an input you get the correct result at output 10ns later. But you can give inputs at each 5ns! Got it, thanks!

  21. Hey there,

    I tried your code but when I try the pipelined version the RTL schema is the same as the simple version. I don't have registers between each multiplier. Do you what the problem is?

  22. we want to produce an array containing the
    distances from each element in memory1 to the element holding the same
    value in memory2. I.e searching for similarity, RAMs are worked well
    ,the state machine is synthesize.i put my code previous assuming each
    element in the memory is std_logic_vector(0 to 7), and addresses from 0
    to 255.if the element in mem1 is 00010100 has address 2 & the same
    element in the mem2 has the address 10 the addresses difference(2-10)=8
    is the distance and so on.
    i hope you understand me. any idea please


Related Posts with Thumbnails