News   RISC OS stuff   Java stuff   x86 stuff   VFP/NEON stuff   LINUX stuff  
VFP/NEON stuff
This page provides some stuff including links and optimisation ideas I got or found while coding my VFP/NEON stuff on Risc OS. Some links cover also non-VFP/NEON assembler stuff.
Links (14/02/2015)
Pulsar's webpage - lots of NEON coding optimisations and findings
MATH NEON library - Lachlan Tychsen's ported typical math.c functions to NEON assembler and C
Shervin Emami's webpage - Shervin Emami ARM Assembly webpage
ARM's Infocenter - all official instruction references and documentation you'll need
ARM's NEON Programmerís Guide - (you need to register to download the document)
Project Ne10 - software library project for NEON (signal processing, vector/matrix math, physics and image processing)
Agner Fog's Software optimization resources - great resource on low level optimization, written for x86, but still useful sometimes
Flatassembler Forum about non-x86 architectures - forum on the multi platform Flatassembler application, section mostly about ARM
Gameboy Assembler Forum - Gameboy Advance Assembler Forum
Gameboy Advance Programming - Gameboy Advance Programming including a tour of ARM Assembly
ARM Assembly Language Programming - Book from Knaggs/Welsh (2004)
Coding for NEON - Part 1 - load and stores
Coding for NEON - Part 2 - dealing with leftovers
Coding for NEON - Part 3 - matrix multiplication
Coding for NEON - Part 4 - shifting left and right
Coding for NEON - Part 5 - rearranging vectors
Condition Codes - 4 - floating point comparisons using vfp
VFP/NEON Register Mapping PDF (03/01/2014)
VFP/NEON Register Mapping PDF
A PDF I created for my coding efforts to put in constants or variables. Quite helpful to keep an overview about the contents of all the registers in your routine
Code Optimisation Findings (03/01/2014)
Square root Calculation

Due to the fact that NEON only provides a reciprocal square root estimate function (VRSQRTE) one would have to use the reciprocal estimate function (VRECPE) to finally get the desired result. To get a better accuracy and speed up the code (especially when multiple VRECPS steps are used) you could easily replace that with a multiply like

SQRT(x) = x * 1/SQRT(x)

resulting in

VMUL.F32    D1,D1,D0

Finding the Maximum/Minimum of 4 single precision floats in one Qx register

This might be trivia, but I still want to point out to that useful pairwise VPMAX/MIN instructions. Those help perfectly to find the maximum/minimum for 4 floats in one Qx register with 2 consecutive instructions:

VPMAX.F32 D2,D0,D1
VPMAX.F32 D2,D2,D2