Post by JF MezeiPost by Alan BrowneWell, that is the point. Hand code in assembler so that every possible
instruction cycle is optimal.
Jave you coded on Alpha? Have you coded on Itanium ? VAX ?
Irrelevant to x86 / ARM transition and, read the news: Alpha is OLD and
dead technology.
The most recent Itanium iteration had no improvement made to it other
than clock speed. And that was in 2017. That is the end of that
product. It's dead Jim.
Stop bringing it up.
And yes, I've coded on VAX VMS, though exclusively HOL I had to
understand the stack operations to integrate across languages (calling
Fortran S/R's from Pascal mainly). I've looked at the PDP-11 and VAX
instruction sets and there was nothing especially daunting about them -
indeed quite friendly as I recall.
Post by JF MezeiThe Macro (VAX assembly langage) *compiler* for IA64 produced
faster/more efficient code than hand coding native IA64 assembler
becauise the LLVM/compilets on Ianiium spent the time to order and block
the operations properly to allow Itanic chips to run fast.
For lazy assembler writers, sure. Don't forget, at heart, the compiler
writes machine instructions that are entirely expressible in assembler.
Thus a good assembler programmer would do fine.
In reality, writing assembler is too expensive (man hour cost) to
warrant it in most cases. So good optimizing compilers are more than
good enough and will do the most useful optimizations most often.
Post by JF MezeiDoing the work native on IA64 would require you not only translate yoru
idea into individual opcodes, but also know what type of operations to
do in what order and insert the IA64 specific operations to tell the
chip which operatiosn depnds on which one.
As I've explained several times, writing good efficient code is the goal
of writing assembler. There's a big man-hour cost in design, coding,
testing and de-bugging not to mention long term life cycle costs. And
of course it's less portable than HOL.
Post by JF MezeiPost by Alan BrowneAdding to that you have no clue what translators may be doing to
optimize code.
Apple has given sufficient hints on Rosetta2, namely that all system
calls and linked to a special library that accepts the call with Intel
argumant passing mechanism and then issue corresponding call to the
"real" routine with ARM argument passing standard.
It's JIT translation as well as translate on install so code is
translated a single time in install cases.
QUOTE
Rosetta 2 can convert an application right at installation time,
effectively creating an ARM-optimized version of the app before you’ve
opened it. (It can also translate on the fly for apps that can’t be
translated ahead of time, such as browser, Java, and Javascript
processes, or if it encounters other new code that wasn’t translated at
install time.) With Rosetta 2 frontloading a bulk of the work, we may
see better performance from translated apps.
ENDQUOTE
https://www.theverge.com/21304182/apple-arm-mac-rosetta-2-emulation-app-converter-explainer
And finally, developers who want to keep up, will be recompiling for M1
- if they haven't already.
(Don't bring up Adobe. We know).
Post by JF MezeiThis means the translator really only translates existing function
without undestanding what it does. Optimizing is much easier at higher
level because the concept of the loop etc are understood by the compiler.
You don't know how sophisticated the translator is. I suspect it has
some very, very clever tricks up its sleeve.
Post by JF MezeiPost by Alan BrowneFor example they could easily decide to remove a bunch
of pushes onto the stack because more registers are available than the
original target.
A language like Postscript is based on a stack, as is a reverse polish
calculator. That stack is part of the logic, not just some means to deal
with shortage of registers. A translator cannot know if you are using
the stack as temporary storage or whether it is an integral part of your
logic. The translator must maintain functionality.
I was referring to stack machine call conventions for parameter passing
and saving return addresses, registers, etc. You're referring to an
implementation abstraction.
An RPN calculator, in HOL code, emulates a calculator stack in vars
(usually a linked list of some kind). This is not the same as the
machine stack, but instead an abstraction of a calculator stack usually
implemented in a HOL such as C, Fortran, Pascal, etc.
And yes, if it recurses (as it should) then the machine stack is used
for that, but the abstraction of the RPN is not on the machine stack.
It is in program memory (Data segment, not stack segment). It could
also use allocated memory if the stack is extremely deep (ie: not likely
to be done but is certainly "doable" for the exercise). This is
typically in the "extra segment" (_x86 speak) and based off of that
register pointer.
Postscript would also implement a stack structure (linked list probably)
to save and restore states through the document.
Post by JF MezeiA compiler generating code for x86 may decide to use stack mechanism to
store a value, and the same code, with teh same compiler targetting ARM
may use a register. But that is a decision made by compiler who
understand the desired goal of the sourcxe code.
You don't need to know the "goal" only that a particular stack located
variable can instead be put into a register. That saves a push/pop and
more importantly is much faster than a memory located var. Going from
the 16x64b registers of the x86 to the 29 x 64b (available) registers of
the ARM will afford a lot of opportunity for the translator to do the same.
Even my _x86 Pascal compiler does this. Example:
.globl _P$INVENBAL_$$_GETFILERECPOINTER$WORD$$PRECPTR
_P$RUNFILM_$$_GETFILERECPOINTER$WORD$$PRECPTR:
# Temps allocated between rsp+0 and rsp+56
# [733] BEGIN
pushq %rbx
pushq %r12
pushq %r13
pushq %r14
pushq %r15
leaq -64(%rsp),%rsp
# Var v located in register r14w
# Var $result located in register rax
# Var TLim located in register xmm0
# Var T located in register xmm0
# Var i located in register eax
# Var R located in register eax
# Var Rc located in register eax
# Var TCount located in register eax
# Var Gr located in register r12b
# Var State located in register r15b
# Var found located in register r13b
ie: all these vars are usually stack located. Now (with the right
switch), they are register located. Two of those are pointers and that
makes for extraordinary speed improvements in accessing and processing
data, esp. with double indirect operations.
The ARM will just add 13 more registers for such optimization!
Post by JF MezeiA translator of already compiled binary code doesn't. If it sees use of
stack, it doesn't know whether it was meant as temporary storage, or if
it was truly meant as a LIFO storage logic desired by the program.
A well designed translator can go to the extent that resulting ops are
faster than the cost of implementation.
Post by JF MezeiPost by Alan BrownePost by JF MezeiThis is why Digital died.
Yep. They are dead. Get over it.
Not sure where you got that quote, bit is was not me who said this in
this thread.
<whoosh>
Post by JF MezeiPost by Alan BrowneNot at all. I probably have more experience with hardware and assembler
on a range of processor types than pretty much everyone here
Simple "embedded device" processors tend to not have very fancy logic
and it is straighforward to code for them. Once you get into high
performnce processors (or the Itanic where Intel tried high
performance), it gets verry messy because of how a processor reacts to
instructions.
No. It's just expensive for the vast majority of applications so a
compiler is used.
Post by JF MezeiWhen you code for a processor that has faults when you tru to access
memory that isn't quadword aligned (64 bits), your fancy assembler code
that ran well on a less complex CPU suddently runs like molasses even
though that processor is supposed to be high performance. This is
something that won't happen with higher level labguage because the
copiler and LLVM know to align all memory access to a quadword to avoid
this and this is done automatically for you. so if you need the 3rd
byte, it will fetch 8 bytes from memory into a register and do the
sfifts to get the byte you want to avoid the memory fault.
Properly written and tested assembler code will avoid such faults.
Further, align pragmas are very present in assembler (and have been for
a very long time going back) if one wants speed over storage. Just
design trade decisions.
[AAA]
Post by JF MezeiPost by Alan BrownePost by JF MezeiIn the case of OS-X, much already exists from IOS, so much could be
re-used, but there is still code needed for the new
thunderbolt/USB-4 drivers and all the variants attached to it,
including ethernet >>>drivers attached to the new thunderbolt/USB-4
IO interface.
Which, as I point out elsewhere, is almost certainly in HOL. High speed
I/O is most often via some variant of DMA.
[See AAA above]
Post by JF MezeiPhotoshop is not low level device driver. Most of a device driver is
I was replying to your I/O points that you snipped out and that I
restored above.[AAA]. Again: device drivers are less and less in
assembler and more and more HOL.
Post by JF Mezeinow at higfher level labguage with only the very lowwest level in
assembler where you only do a few instructions.
As also explained to you a couple times, image processing is very large
integer arrays. So assembler is a great way to do some functions on
such data.
Going from x86_64 to ARM with 29 available 64b registers will seem like
a gift from heaven to Adobe for such processing and will help it blaze.
Post by JF MezeiPost by Alan BrowneAssembler allows the designer to carefully optimize every tiny step and
if desired model every machine cycle, etc.
"allows" is the keyword. The problem is that it requires you have
intimate knowledge of the whole architecturure to know what combinations
of OP codes you can use and in what order and know how the CPU will
pipeline them into different instriction prefetch etc.
A good programmer should have such knowledge - as I've pointed out
elsewhere.
Post by JF MezeiWhen you are dealing with simplerembedded device CPUs, life is much
simpler and you focus on optimizing logic because every assembvler
instruction is executed sequentially anyways.
Even the simplest microcontrollers these days have optimizations and
it's entirely independent of the language used.
Assembler is not worth the cost in 99.99% of cases. Adobe, who cater to
a huge audience including very high end photography and marketing
departments want to not only do the most, but do it fast. So their core
processes are worth the man hour investment - which is part of the
relatively high price we pay for many Adobe products.
--
"...there are many humorous things in this world; among them the white
man's notion that he is less savage than the other savages."
-Samuel Clemens