RISC-V public beta platform released · FFTW porting and performance comparison

Sandor · September 14, 2023, 4:01pm

Introduction to FFTW

FFT (Fast Fourier Transform) is a super classic algorithm for digital signal processing. It is used in the fields of quantum physics, spectral analysis, audio and video stream signal processing, oil exploration, earthquake prediction, weather forecast, probability theory, coding theory, medical fault diagnosis and other fields. Has a wide range of applications. In mathematics, the Fourier series is a way of representing wave-like functions as simple sine waves.

There are many fast Fourier transform algorithms, including open source libraries such as FFTW, and some chip companies have also designed FFT optimization libraries for their own chips, such as Intel’s MKL_FFT and Huawei’s KML_FFT.

Stop, there's a question here?

Since there is a universal, open source and free library like FFTW, why do chip companies still spend money to develop their own FFT optimization libraries?

Through this article, maybe you will know something about it.

FFTW (Fast Fourier Transform in the West) is an open source library for Fast Fourier Transform (FFT). It is released under a free software license (similar to GPL). You can download it on its official website or code repository Find the latest version of FFTW and use it according to the terms of the license. It can be used to perform efficient numerical calculations on a variety of computing platforms. FFTW can greatly improve the calculation speed of FFT by using pre-decomposition algorithm and caching technology. Compared with the traditional FFT algorithm, FFTW can save several times the calculation time. FFTW can be used not only for scientific computing and signal processing, but also for image processing, audio processing and other fields. The excellent performance of FFTW makes it an indispensable tool in many scientific computing and engineering applications. Official website address: http://fftw.org/ github address: GitHub - FFTW/fftw3: DO NOT CHECK OUT THESE FILES FROM GITHUB UNLESS YOU KNOW WHAT YOU ARE DOING. (See below.)

This experiment only moves FFTW to compile and run on the RISC-V hardware platform. The article compares the single-core performance of SG2042 and FT2000, just to establish an optimization benchmark first, and does not represent the performance gap between the two chips. . Usually the performance of some software tests is related to the following factors:

The hardware computing power level of the chip
Compiler (RISC-V’s GCC belongs to the first generation version)
Compilation options (especially whether SIMD acceleration is enabled)
Other factors such as operating system and dependent libraries

2. Platform environment

[Hardware parameters]

Processor: SG2042 x 1
Number of cores: 64 cores
L1 Cache: I:64KB and D:64KB
L2 Cache: 1MB/Cluster
L3 Cache: 64MB System Cache
DRAM: DDR4 16Gx4

[Software Environment]

linux version: 22.10
gcc version: 10.2.0

3. Installation process

We download the latest program package from the official website. The latest version is: 3.3.10 Download link: FFTW Download Page
The file we downloaded is: fftw-3.3.10.tar.gz
Unzip:

ubuntu@perfxlab:~$ tar zxvf fftw-3.3.10.tar.gz

Configuration:

ubuntu@perfxlab:~/fftw-3.3.10$ ./configure --prefix=/usr/fftw3

Install:

ubuntu@perfxlab:~/fftw-3.3.10$ make install

Configure environment variables:
After installation, we need to configure relevant environment variables, open ~/.bashrc, and add the following code at the end of the file:

export LD_LIBRARY_PATH=/usr/fftw3/lib:$LD_LIBRARY_PATH

Verify that the installation is successful:
Query FFTW version:

ubuntu@perfxlab:~/fftw-3.3.10$ fftw-wisdom --version

The installation is successful when the following information appears:

fftw-wisdom tool for FFTW version 3.3.10.

Copyright (c) 2003, 2007-14 Matteo Frigo
Copyright (c) 2003, 2007-14 Massachusetts Institute of Technology

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA

Let’s write a simple test code to verify whether the installation is successful:

#include <stdio.h>
#include <stdlib.h>
#include <fftw3.h>

int main() {
    int N = 8; // signal length
    fftw_complex *in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
    fftw_complex *out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);

    fftw_plan plan = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);

    //Initialize input signal
    for (int i = 0; i < N; i++) {
        in[i][0] = i; // real part
        in[i][1] = 0; // imaginary part
    }
    //Perform Fourier transform
    fftw_execute(plan);

    // print results
    printf("Input Signal:\n");
    for (int i = 0; i < N; i++) {
        printf("%f + %fi\n", in[i][0], in[i][1]);
    }

    printf("Transformed Signal:\n");
    for (int i = 0; i < N; i++) {
        printf("%f + %fi\n", out[i][0], out[i][1]);
    }

    fftw_destroy_plan(plan);
    fftw_free(in);
    fftw_free(out);

    return 0;
}

If you want to test the float type: you need to modify the following places:
Change fftw_complex to fftwf_complex and fftw_plan_dft_1d to fftwf_plan_dft_1d.

compile options

The following is the compiled double type. If you compile float, you need to use -lfftw3f.
SG2042 compilation option (note that FFTW currently does not have a simd version that supports risc-v)

-march=rv64gcv0p7_zfh_xtheadc -mabi=lp64d -mtune=c920 -O3

FT2000 compile options

-Wall -Wextra -funroll-loops -O3 -DNDEBUG -DNOPEN_HASWELL_OPT

Compile code:

ubuntu@perfxlab:~/cheng$ gcc -o fftw_test testfftw.c -lfftw3 -lm

The results of executing the program are as follows:

ubuntu@perfxlab:~/cheng$ ./fftw_test

Input Signal:

0.000000 + 0.000000i
1.000000 + 0.000000i
2.000000 + 0.000000i
3.000000 + 0.000000i
4.000000 + 0.000000i
5.000000 + 0.000000i
6.000000 + 0.000000i
7.000000 + 0.000000i
Transformed Signal:
28.000000 + 0.000000i
-4.000000 + 9.656854i
-4.000000 + 4.000000i
-4.000000 + 1.656854i
-4.000000 + 0.000000i
-4.000000 + -1.656854i
-4.000000 + -4.000000i
-4.000000 + -9.656854i

Okay, the result is correct. The installation and testing have been completed and you can use it happily.

4. Performance testing and comparison

Below, we will evaluate the performance of SG2042 based on the FFTW open source project and make a simple comparison with FT2000.

The test code is as follows
We mainly tested the performance of the core function fftw_execute in the FFTW (Fastest Fourier Transform in the West) library. We evaluated the performance of the fftw_execute function by executing it multiple times and taking the average, and the results were measured in GFlops (giga floating-point operations per second). in_place and out_place represent whether the same memory is used.

#include <stdio.h>
#include <time.h>
#include <fftw3.h>
#include <math.h> // Include math.h to use the log2 function

int main() {
    int signal_lengths[] = {32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384}; // Different signal lengths
    int num_lengths = sizeof(signal_lengths) / sizeof(signal_lengths[0]);


    for (int i = 0; i < num_lengths; i++) {
        int N = signal_lengths[i];
        fftw_complex *in = (fftw_complex*)fftw_malloc(sizeof(fftw_complex) * N);
        fftw_complex *out = (fftw_complex*)fftw_malloc(sizeof(fftw_complex) * N);

        // Fill in input data (using double precision floating point numbers)
        for (int j = 0; j < N; j++) {
            in[j][0] = 1.0; // real part (double)
            in[j][1] = 0.0; // imaginary part (double)
        }

        double total_elapsed_time = 0.0;
        int num_iterations = 20; // Each length signal is executed 20 times
        //Perform the Fourier transform of a double-precision floating point number
        fftw_plan plan = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
        for (int j = 0; j < num_iterations; j++) {
            clock_t start_time = clock();
            fftw_execute(plan);
            clock_t end_time = clock();
            double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
            total_elapsed_time += elapsed_time;
        }
        fftw_destroy_plan(plan);
        double average_time = total_elapsed_time / num_iterations;

        // Calculate FLOPS (floating point operations per second)
        double flops = 5.0 * N * log2(N) / average_time; // Assume that the complexity of the FFT algorithm is 5*N*log2(N)

        // Calculate GFLOPS (gillions of floating point operations per second)
        double gflops = flops / 1e9;

        printf("Average FFTW execution time (length %d): %f seconds GFLOPS: %f\n", N, average_time, gflops);

        fftw_free(in);
        fftw_free(out);
    }

    return 0;
}

If you want to test the float type: you need to modify the following places:
Change fftw_complex to fftwf_complex and fftw_plan_dft_1d to fftwf_plan_dft_1d.

compile options
The following is the compiled double type. If you compile float, you need to use -lfftw3f.
SG2042 compilation option (note that FFTW currently does not have a simd version that supports risc-v)

-march=rv64gcv0p7_zfh_xtheadc -mabi=lp64d -mtune=c920 -O3

FT2000 compile options

-Wall -Wextra -funroll-loops -O3 -DNDEBUG -DNOPEN_HASWELL_OPT

Performance graphs and comparisons
The data type is float in_place:

5. Brief summary

The experimental process of migrating FFTW to the RISC-V processor SG2042 is relatively simple, thanks to the continuous improvement of the RISC-V software ecosystem. But there are still some problems:

Judging from the above test results, when the data type is float, the performance of SG2042 is significantly different from that of FT2000; when the data type is double, as the signal length increases, the performance gap gradually decreases and approaches.
FFTW currently does not support the Vector version of RISC-V, and there is still great performance potential to be tapped. We either wait for the FFTW release to support RISC-V; or someone develops a RISC-V FFT optimization algorithm library by themselves. Back to the question at the beginning of the article,
“Since there is a universal, open source, and free library like FFTW, why do chip companies still spend money to develop their own FFT optimization libraries?” In fact, there are many similar ones like this. Computing libraries have similar problems.
End of text

Translated From: RISC-V公测平台发布 · FFTW的移植和性能对比