MiKuSite - VFP/NEON stuff



News RISC OS stuff Java stuff x86 stuff VFP/NEON stuff LINUX stuff

VFP/NEON stuff
This page provides some stuff including links and optimisation ideas I got or found while coding my VFP/NEON stuff on Risc OS. Some links cover also non-VFP/NEON assembler stuff.
Links (14/02/2015) Pulsar's webpage - lots of NEON coding optimisations and findings MATH NEON library - Lachlan Tychsen's ported typical math.c functions to NEON assembler and C Shervin Emami's webpage - Shervin Emami ARM Assembly webpage ARM's Infocenter - all official instruction references and documentation you'll need ARM's NEON Programmer’s Guide - (you need to register to download the document) Project Ne10 - software library project for NEON (signal processing, vector/matrix math, physics and image processing) Agner Fog's Software optimization resources - great resource on low level optimization, written for x86, but still useful sometimes Flatassembler Forum about non-x86 architectures - forum on the multi platform Flatassembler application, section mostly about ARM Gameboy Assembler Forum - Gameboy Advance Assembler Forum Gameboy Advance Programming - Gameboy Advance Programming including a tour of ARM Assembly ARM Assembly Language Programming - Book from Knaggs/Welsh (2004) Coding for NEON - Part 1 - load and stores Coding for NEON - Part 2 - dealing with leftovers Coding for NEON - Part 3 - matrix multiplication Coding for NEON - Part 4 - shifting left and right Coding for NEON - Part 5 - rearranging vectors Condition Codes - 4 - floating point comparisons using vfp
VFP/NEON Register Mapping PDF (03/01/2014) VFP/NEON Register Mapping PDF A PDF I created for my coding efforts to put in constants or variables. Quite helpful to keep an overview about the contents of all the registers in your routine
Code Optimisation Findings (03/01/2014) Square root Calculation Due to the fact that NEON only provides a reciprocal square root estimate function (VRSQRTE) one would have to use the reciprocal estimate function (VRECPE) to finally get the desired result. To get a better accuracy and speed up the code (especially when multiple VRECPS steps are used) you could easily replace that with a multiply like `SQRT(x) = x * 1/SQRT(x)` resulting in `VRSQRTE.F32 D1,D0` `VMUL.F32 D1,D1,D0` Finding the Maximum/Minimum of 4 single precision floats in one Qx register This might be trivia, but I still want to point out to that useful pairwise VPMAX/MIN instructions. Those help perfectly to find the maximum/minimum for 4 floats in one Qx register with 2 consecutive instructions: `VPMAX.F32 D2,D0,D1` `VPMAX.F32 D2,D2,D2`