Skip to content

AOCL 2.2 changes - Majorly include LAPACK 3.9.0 support#36

Open
rsanagap wants to merge 1485 commits intoflame:masterfrom
amd:master
Open

AOCL 2.2 changes - Majorly include LAPACK 3.9.0 support#36
rsanagap wants to merge 1485 commits intoflame:masterfrom
amd:master

Conversation

@rsanagap
Copy link

@rsanagap rsanagap commented Jul 5, 2020

Field,

Please review and merge them.

Thanks

samahmad and others added 30 commits October 9, 2024 02:48
Signed-off-by: samahmad <Sameer.Ahmad@amd.com>
AMD-Internal: CPUPL-5556
Change-Id: I774e1a39b12feb6aef65d964704017aea7a45579
Change-Id: I74067c461c68cff55c6b5f02a9080e3df9352b9e
-fopenmp was missing in compiler flag. Included the flag in both automake and cmake configure file

CPUPL-5677
Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com>
Change-Id: I444abc4ce35dbbce1b45bdde27faf2a5f3f250f2
  1. Enable Multithreading in DGETRF for input range M and N > 160.
  2. Default and MAX thread count set to 64.
  3. 32 threads gives best bench marking results, so from 160 to 6500 inputrange set to 32.

AMD Internal : [CPUPL-4978]

Change-Id: I7c68982bfe295e95563f40d77547ef101fbd97e2
AOCL-LAPACK Version upgraded from 4.2.1 to 5.0.1
Signed-off-by: tprnaidu <tprnaidu@amd.com>

Change-Id: I066ae2d790f55b135ac00b868ef9acd03ba1762d
Gains observed upto 15% for the following block size/input size
combinations on lib-genoa-05 machine

1. For float datatype, block sizes that work well are:-
   a) 60 for size greater than 3000 x 3000
   b) 48 for size greater than 1000 x 1000
   c) 24 for rest all cases
2. For double datatype, block sizes that work well are:-
   a) 64 for size greater than 2000 x 2000
   b) 60 for size greater than 1000 x 1000
   c) 32 for rest all cases
3. For complex datatype, block sizes that work well are:-
   a) 64 for size greater than 600 x 600
   b) 48 for size greater than 100 x 100
   c) 32 for rest all cases
4. For double complex datatype, block sizes that work well are:-
   a) 64 for size greater than 400 x 400
   b) 48 for size greater than 100 x 100
   c) 32 for rest all cases

Signed-off-by: Vibhav Gupta <Vibhav.Gupta@amd.com>
AMD-Internal: CPUPL-5876

Change-Id: Ide6acd9d41faafac6a84bdd18d38849789636fbe
… multi-thread is enabled

CPUPL-5677
Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com>
Change-Id: Iaa1b79ca8e7e7c15df31b02e7fef450eb530a4e7
Added LAPACKE interface testing support in the main test suite
for GBTRF and GBTRS APIs.

Signed-off-by: Sridhar Govindaswamy <Sridhar.Govindaswamy@amd.com>
Change-Id: I97b7aaf8b697a1d1779bbebb69dc7cec9348d1f8
Correction in condition checks for invalid input params in sgesdd_fla_check,
dgesdd_fla_check APIs. Modified condition to check ldvt based on jobvt
instead of jobu.

Modified jobz comparison logic to check regardless of case.

AMD-Internal: CPUPL-5889
Signed-off-by: dnikku <Deepika.Nikku@amd.com>
Change-Id: Ib3fed47fbd05163aa2b496cec6cd59cd79156880
Update License text with latest Third Party Notices

Change-Id: I79589b0821cab802157406415d49d898a7a83d2f
This reverts commit 70132a4.

Change-Id: I1002f00295f77caf2097ff67bf1b51889067bf96
NOTICES file with Third Party Licence information

Change-Id: I8b91edd3481bef3e5378a2c91e67c1b4eb81d1b1
APIs from AOCL-Utils to get ISA information have changed in
4.2 release and the ones used in libflame have been deprecated.
Use the latest AOCL-Utils API.

AMD Internal : [CPUPL-5906]

Change-Id: I84d576f9749a399aea23d96b5d2d636497bed540
Earlier these APIs were just a wrapper around sgelss/dgelss.
gelsd APIs provide significant performance gains over gelss
APIs.

Signed-off-by: samahmad <Sameer.Ahmad@amd.com>
AMD-Internal: CPUPL-5766
Change-Id: I867c107816fcd50fb16a890ca5947c6b9ff80e3d
Corrected error in calculation of length parameter while
handling of values less than safe-min in norm calculation.

Also, fixed issue in usage of lda in macro inside macro.

Signed-off-by: Vasanthakumar R <varajago@amd.com>
AMD-Internal: SWLCSG-3226, SWLCSG-3217
Change-Id: Ic75a10aa7020f043e71f1501bb4219f4498be901
   1. Optimized the LAPACK_GETRI_SMALL_D_3x3 kernel by reducing the internal repetitive memory load/stores

AMD Internal : [CPUPL-5865]

Change-Id: Ib4e33cccdc4a78820014a676bcb9808f4c685797
… major

Fixed init_matrix_from_file() API to read matrix from input file in
column major format

Added fix for gtsv AOCC issue.

AMD-Internal: CPUPL-5922
Signed-off-by: dnikku <Deepika.Nikku@amd.com>
Change-Id: I159583837028dd11ee1dbc844b0d495d0c1be1fc
Initialization of first column/row of U added.
Previous optimization to remove the initialization
caused test failures in few performance cases.
Corrected modification of signs of Vt in 2x2 cases.

Signed-off-by: Vasanthakumar R <varajago@amd.com>
AMD-Internal: CPUPL-6074
Change-Id: I19b7044029b4580013ea7decb58c3500097d0f88
- Overflow/underflow tests for sygvd/hegvd
- Memory leak fixes for sygvd test cases
- Enabling lapacke interfaces for sygvd/hegvd

Change-Id: I79ec9c009e6ba52df17bc6247bb726e60193d5ed
Signed-off-by: samahmad <Sameer.Ahmad@amd.com>
AMD-Internal: CPUPL-6037
… cases

Fixed the copy matrix sizes in validate_gesdd/gesvd to avoid out of bound
memory access while testing corner cases.
for gesdd, jobz = O, m >= n, ldu = 1, m < n, ldvt = 1
for gesvd, jobu/vt = O, ldu/ldvt = 1 cases

Fixed gtsv test2 under validate_gtsv(). Scaling down the residual
by 10 times to fall in the expected threshold range as input matrix Xact
is randomly generated.

AMD-Internal: CPUPL-5926
Signed-off-by: dnikku <Deepika.Nikku@amd.com>
Change-Id: Ia99bb6d81b76de394265ffded0069fb440de979f
details: Datatype alignment changes for the structures used in test suite
Signed-off-by: ksaithar <katteboina.saitharun@amd.com>
Change-Id: I08ce86f5d642189b6f9142c74af41a633415b1f3
Components added:
1. Test run/validation
2. Negative test cases
3. Extreme test cases
4. Overflow/Underflow tests
5. Lapacke test

Signed-off-by: Vibhav Gupta <Vibhav.Gupta@amd.com>
AMD-Internal: CPUPL-5903

Change-Id: I38fa28ac0216740e0669e41509ca7870fd3adab8
1. Move block size computation to a separate function for each of the
   4 types.
2. Optimal block sizes for various input sizes vary as OMP_NUM_THREADS
   is varied. Set optimal block sizes based on input size ranges only
   when OMP_NUM_THREADS=1
3. For small sizes, take the un-optimized path because with the
   optimized path there are regressions due to overhead of openmp
   calls.

Gains obtained for single threaded runs - Upto 15% on genoa and 28%
on turin

Signed-off-by: Vibhav Gupta <Vibhav.Gupta@amd.com>
AMD-Internal: CPUPL-5876

Change-Id: I8fdeccdf0debdacec3913f8192711d86e9d62314
Port Netlib Lapack 3.12 FORTRAN code to C files for double precision APIs

Signed-off-by: Venkatesha <vprasada@amd.com>
AMD-Internal: [CPUPL-5708]
Change-Id: If2ddc85a9ad0818c96155945340b9cea23b40c8e
- Added avx2, avx512 and parallel version for sgetrf

Change-Id: I724cc5c9bf98f42014bcaf680a2fa7373195f10d
Signed-off-by: samahmad <Sameer.Ahmad@amd.com>
AMD-Internal: CPUPL-6060
Following features are implemented in this commit:
1. Library path and include path for aocl-utils and blis can be automatically inferred while building libflame if pkg-config files
for these libraries are available. Only works on Linux for now.
2. Various cmake configure/build/install/test/workflow presets.
3. Cmake presets for Windows (msvc and ninja). As of now test presets do not work!
4. Minimum cmake version upgraded to 3.26.0

Preset names follow the convention: <os>-<make/ninja>-<compiler>-<st/mt>-<lp/ilp>-<static/shared>-<isa-mode>-<other optional commands>

Usage:
$ cmake --build --list-presets

-- Without aocl-utils pkgconfig file
$ cmake --preset {chosen-preset} -DLIBAOCLUTILS_INCLUDE_PATH={aoclutils header path} -DLIBAOCLUTILS_LIBRARY_PATH={aoclutils library path}

-- With aocl-utils pkgconfig file
$ cmake --preset {chosen-preset}

$ cmake --build --preset {chosen-preset}

Build and test workflow

-- If aocl-utils and blis pkg-config files are available
$ cmake --workflow --preset {chosen-preset}

More info in BUILD.md

Change-Id: I8bc54b3eabed9a18c305e911df9aa76d8ff746d0
Signed-off-by: samahmad <Sameer.Ahmad@amd.com>
AMD-Internal: CPUPL-5862
Port Netlib Lapack-3.12 newly added double precision fortran files to c
Files added : dgedmd.c, dgedmdq.c, dgeqp3rk.c, dlaqp2rk.c, dlaqp3rk.c.
Netlib test for lapack-3.12 included.

Signed-off-by: Venkatesha <vprasada@amd.com>
AMD-Internal: [CPUPL-5708]
Change-Id: I60b5c47505162882a19f2086e4842c858e0586e8
a. Added separate invoke functions in CPP for each API
b. Added support for cmake and make
c. Added support for --interface in cmd line
d. Resolved warnings in existing interface header file
e. Resolved errors/warnings in Windows
f. Added ENABLE_CPP_TEST flag in cmake, make files to enable/disable CPP test interface.
g. Updated Readme as per latest changes.
h. Added CPP changes for 25+ test APIs.

Change-Id: I6b77d24e204833134401c69813a3a1672de02c18
Port Netlib Lapack 3.12 FORTRAN code to C files for double precision complex APIs
Note: Retained lapack-3.11 zlaqr5.c, to overcome netlib test failures.

Signed-off-by: Venkatesha <vprasada@amd.com>
AMD-Internal: [CPUPL-6150]
Change-Id: I40df2b270d82159cd0bc16a0158951139054b90a
Signed-off-by: Vibhav Gupta <Vibhav.Gupta@amd.com>
AMD-Internal: CPUPL-6155

Change-Id: Icbd84fdfa434875bf3bfd2072b09d8e77c326702
Ahmad, Sameer and others added 30 commits October 31, 2025 12:18
Handled a case where invalid arguments to dlacpy
could cause segmentation faults

AMD-Internal: CPUPL-7560
Handled a case where invalid arguments to dlacpy
could cause segmentation faults

AMD-Internal: CPUPL-7560
-> Updated BUILD.md to be consistent with user guide.
-> Removed ENABLE_EMBED_AOCLUTILS related information from BUILD.md.

AMD-Internal: [CPUPL-7491]
-> Updated BUILD.md to be consistent with user guide.
-> Removed ENABLE_EMBED_AOCLUTILS related information from BUILD.md.

AMD-Internal: [CPUPL-7491]
DGESDD was reusing threshold from SVD for small size paths.
It is changed to SDD specific value to make it independent of SVD.
Macro name for thresholds changed to include full API name.

AMD Internal: CPUPL-7572
DGESDD was reusing threshold from SVD for small size paths.
It is changed to SDD specific value to make it independent of SVD.
Macro name for thresholds changed to include full API name.

AMD Internal: CPUPL-7572
…eration and skipping validation (#136)

* AOCL LAPACK Testsuite: Implemented test mode support

Implemented --test-mode flag inplace of --random_init
which helps to add random input generation and/or
skips output validation.

AMD-Internal: CPUPL-7435
Signed-off-by: dnikku <Deepika.Nikku@amd.com>
* Add ai-pr-platform-app.yml

* Update ai-pr-platform-app.yml

* Update ai-pr-platform-app.yml

* Update ai-pr-platform-app.yml

* Create ai_coverity.json

Add AI Coverity Fix - Historical Analysis related changes

* Update ai-pr-platform-app.yml

* Update ai-pr-platform-app.yml

* Update ai-pr-platform-app.yml

Removed email

* Update ai_coverity.json

* Update ai-pr-platform-app.yml

* Delete .github/psdb-jenkins-trigger.yml

* Delete .github/recipients.yaml

* Delete .github/workflows/ai-code-review-trigger.yml

* Delete .github/self_enablement_config.yaml

---------

Co-authored-by: Tyagi, Shubham <Shubham.Tyagi@amd.com>
Co-authored-by: Prabhu, Anantha <Anantha.Prabhu@amd.com>
…ode instead of FLA code (#162)

The underlying APIs for DGELSS - DGEBRD, DGEBD2 and DORGBR were utilizing FLA codepaths.
It was found that utilizing lapack codepath provides a significant boost to performance of small matrices.
A threshold has been added to switch between lapack and FLA codepaths for optimal performance.
* Addition of GEQPF in GEQP3 files

* Corrected RWORK allocation for CPP

* Removed strcasecmp()

* Correct workspace allocation for complex in LAPACKE mode

* Added same_string function in test_common

* Restored comments

* Removed redundant comments + Removed unused lwork parameter from GEQPF

* Removed static variable 'geqpf_mode' to avoid race conditions
Root cause: Residual value is NAN/INF as norm value is 0 and test case fails.

Solution: Check norm value is not 0 before residual calculation.
Added FLA_COMPUTE_RESIDUAL macro for residual computation.

AMD-Internal: CPUPL-7592
Used existing AVX2 code for medium sizes.
AMD-Internal: CPUPL-7516
* AOCL LAPACK Test suite: Addition of API wise ctests 

Revamped the test suite to generate API-wise CTests
for faster Jenkins execution.
One ctest for each of the available code paths is added.
GETRF , SYEV are included in this commit.
* AOCL-LAPACK: Build warnings fixing

* Update src/base/flamec/main/FLA_Obj.c

Co-authored-by: ai-pr-platform[bot] <217061646+ai-pr-platform[bot]@users.noreply.github.com>

* Update src/map/lapack2flamec/f2c/c/dlaswp.c

Co-authored-by: ai-pr-platform[bot] <217061646+ai-pr-platform[bot]@users.noreply.github.com>

* Update test/legacyflame/src/test_apcaqutinc.c

Co-authored-by: ai-pr-platform[bot] <217061646+ai-pr-platform[bot]@users.noreply.github.com>

* Update test/legacyflame/src/test_trinv.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update test/legacyflame/src/test_lyap.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update test/legacyflame/src/test_apcaqutinc.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update test/legacyflame/src/test_trinv.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update test/legacyflame/src/test_lyap.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update test/legacyflame/src/test_apcaqutinc.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update test/legacyflame/src/test_apqutinc.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/base/flamec/main/FLA_Obj.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* AOCL-LAPACK: Fix review comments

* Update src/map/lapack2flamec/f2c/c/dlaswp.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* AOCL-LAPACK: Fixing review comments

* Update test/legacyflame/src/test_apcaqutinc.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/base/flamec/main/FLA_Obj.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update test/legacyflame/src/test_apcaqutinc.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* AOCL-LAPACK: Addressed review suggestions

* Update src/base/flamec/main/FLA_Obj.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/base/flamec/main/FLA_Obj.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update src/base/flamec/main/FLA_Obj.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* AOCL-LAPACK: Error Fixes

* Update test/legacyflame/src/test_caqrutinc.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update test/legacyflame/src/test_apcaqutinc.c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* AOCL-LAPACK: Addressed review comments

* Reverted type casting from uintptr_t to unsigned long

* Update CMakeLists.txt

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: tprnaidu <tprnaidu@amd.com>
Co-authored-by: ai-pr-platform[bot] <217061646+ai-pr-platform[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Updated License and TPN files according to latest requirements.
* Resolving merge conflicts

* replaced GEMM switches with fla_invoke_gemm

* BDSQR: Added conformance tests

* BDSQR: Added Overflow/Underflow tests

* BDSQR: Clang-format

* U and VT are orthogonal, D and E are derived from GEBRD. Validation fixed

* Used build_bidiagonal_matrix function from test_common

* Used build_bidiagonal_matrix function from test_common

* Added BRT support

* Fixed BDSQR BRT Tests + Corrected GEJSV api config file test parameters

* Resolving conflicts

* resolving conflicts

* used fla_mem_alloc in test_gejsv and addressed negative test failing in BDSQR

* Removed trailing whitespaces.

* U and VT from GEBRD/ORGBR + Added C matrix validations.

* Used GETRI for C matrix validation + SVD reconstruction test

* Fixed conformance tests and validation

* Addressed segfault for lapacke_row

* Removed redundant variables.

* Correct matrix dimension allocation.

* Reconstruction U and VT tests

* Fixed zero residual issue

* Fixed compute_matrix_norm call

* Added compute_matrix_inverse to test_common

* Removed reset_matrix calls

* Fixed memory leaks
Address alignment of key x86 optimized functions called for DGELS.
This improves the performance of small size problems for dgels.
Also removed warnings in CPP interface for deprecated APIs.

Root cause: Residual value is NAN/INF as norm value is 0 and test case fails.

Solution: Added prevention check, if norm value > safe_min before residual calculation.
Added common eps (E, P), safe min to be used to compute residual changes of all APIs.

AMD-Internal: CPUPL-7742
Fixed build error when libflame is built for AVX2 with
ENABLE_AOCL_BLAS option.

Fix: Used macro BLIS_KERNELS_ZEN4 to run direct BLIS
kernels for zen4 and above.

Change-Id: I19e4cea12becd9daac7dc85f4ce3c0ebb5d366c9
AMD-Internal: CPUPL-7475
* AOCL-LAPACK: Fix for GELSS validation failure for Rank < N

-> Updated validation formula for rank < n case.
-> Updated fla_invoke_gemm function to have alpha and beta parameters.

AMD-Internal: [CPUPL-7571]
…… (#220)

AOCL-LAPACK: Fix for Reproducibility failures observed on POTRF, POTF2, POTRI, POTRS, GEBRD APIs

-> aocl_fla_init() was not invoked before checking the minimum arch id, added this call for potrf and potri code.
-> Fix for GEBRD reproducibility failures

AMD-Internal: [CPUPL-7493]

Co-authored-by: Reshi Krish <reshikrish.thangajawahar@amd.com>
Updated version string in the so_version file and wherever applicable.
We have removed support for auto-conf tools based build.
Hence removing this file.
Co-authored-by: tprnaidu <tprnaidu@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants