VFP/NEON stuff
|
This page provides some stuff including links and optimisation ideas I got or found while coding my VFP/NEON stuff on Risc OS. Some links cover also non-VFP/NEON assembler stuff.
|
Links (14/02/2015)
Pulsar's webpage - lots of NEON coding optimisations and findings
MATH NEON library - Lachlan Tychsen's ported typical math.c functions to NEON assembler and C
Shervin Emami's webpage - Shervin Emami ARM Assembly webpage
ARM's Infocenter - all official instruction references and documentation you'll need
ARM's NEON Programmer’s Guide - (you need to register to download the document)
Project Ne10 - software library project for NEON (signal processing, vector/matrix math, physics and image processing)
Agner Fog's Software optimization resources - great resource on low level optimization, written for x86, but still useful sometimes
Flatassembler Forum about non-x86 architectures - forum on the multi platform Flatassembler application, section mostly about ARM
Gameboy Assembler Forum - Gameboy Advance Assembler Forum
Gameboy Advance Programming - Gameboy Advance Programming including a tour of ARM Assembly
ARM Assembly Language Programming - Book from Knaggs/Welsh (2004)
Coding for NEON - Part 1 - load and stores
Coding for NEON - Part 2 - dealing with leftovers
Coding for NEON - Part 3 - matrix multiplication
Coding for NEON - Part 4 - shifting left and right
Coding for NEON - Part 5 - rearranging vectors
Condition Codes - 4 - floating point comparisons using vfp
|
VFP/NEON Register Mapping PDF (03/01/2014)
VFP/NEON Register Mapping PDF A PDF I created for my coding efforts to put in constants or variables. Quite helpful to keep an overview about the contents of all the registers in your routine
|
Code Optimisation Findings (03/01/2014)
Square root Calculation
Due to the fact that NEON only provides a reciprocal square root estimate function (VRSQRTE) one would have to use the reciprocal estimate function (VRECPE) to finally get the desired result.
To get a better accuracy and speed up the code (especially when multiple VRECPS steps are used) you could easily replace that with a multiply like
SQRT(x) = x * 1/SQRT(x)
resulting in
VRSQRTE.F32 D1,D0
VMUL.F32    D1,D1,D0
Finding the Maximum/Minimum of 4 single precision floats in one Qx register
This might be trivia, but I still want to point out to that useful pairwise VPMAX/MIN instructions. Those help perfectly to find the maximum/minimum for 4 floats in one Qx register with 2 consecutive instructions:
VPMAX.F32 D2,D0,D1
VPMAX.F32 D2,D2,D2
|
|