Skip navigation
Home/Support/

Forums

1268 Views 21 Replies Latest reply: Jul 15, 2010 11:44 AM by Bernd Paradies RSS
bkamenov User 20 posts since
Jun 10, 2010
Currently Being Moderated

Jun 15, 2010 12:27 PM

inline functions in C, gcc optimization and floating point arithmetic issues

For several days I really have become a fan of Alchemy. But after intensive testing I have found several issues which I'd like to solve but I can't without any help.

 

So...I'm porting an old game console emulator written by me in ANSI C. The code is working on both gcc and VisualStudio without any modification or crosscompile macros. The only platform code is the audio and video output which is out of scope, because I have ported audio and video witin AS3.

 

Here are the issues:

 

1. Inline functions - Having only a single inline function makes the code working incorrectly (although not crashing) even if any optimization is enabled or not (-O0 or O3). My current workarround is converting the inline functions to macros which achieves the same effect. Any ideas why inline functions break the code?

 

2. Compiler optimizations - well, my project consists of many C files one of which is called flash.c and it contains the main and exported functions. I build the project as follows:

 

gcc -c flash.c -O0 -o flash.o     //Please note the -O0 option!!!
gcc -c file1.c -O3 -o file1.o
gcc -c file2.c -O3 -o file2.o
... and so on

 

gcc *.o -swc -O0 -o emu.swc   //Please note the -O0 option again!!!

mxmlc.exe -library-path+=emu.swc --target-player=10.0.0 Emu.as

or file in $( ls *.o ) //Removes the obj files
    do
        rm $file
    done

 

If I define any option different from -O0 in gcc -c flash.c -O0 -o flash.o the program stops working correctly exactly as in the inline funtions code (but still does not crash or prints any errors in debug). flash has 4 static functions to be exported to AS3 and the main function. Do you know why?

If I define any option different from -O0 in gcc *.o -swc -O0 -o emu.swc  the program stops working correctly exactly as above, but if I specify -O1, -O2 or O3 the SWC file gets smaller up to 2x for O3. Why? Is there a method to optimize all the obj files except flash.o because I suspect a similar issue as when compilling it?

 

3. Flating point issues - this is the worst one. My code is mainly based on integer arithmetic but on 1-2 places it requires flating point arithmetic. One of them is the conversion of 16-bit 44.1 Khz sound buffer to a float buffer with same sample rate but with samples in the range from -1.0 to 1.0.

 

My code:

 

void audio_prepare_as()
{
    uint32 i;
   
    for(i=0;i<audioSamples;i+=2)
    {
        audiobuffer[i] = (float)snd.buffer[i]/32768;
        audiobuffer[i+1] = (float)snd.buffer[i+1]/32768;
    }
}

 

My audio playback is working perfectly. But not if using the above conversion and I have inspected the float numbers - all incorrect and invalid. I tried other code with simple floats - same story. As if alchemy refuses to work with floats. What is wrong? I have another lace whre I must resize the framebuffer and there I have a float involved - same crap. Please help me?

 

Found the floating point problem: audiobuffer is written to a ByteArray and then used in AS. But C floats are obviously not the same as those in AS3. Now the floating point is resolved.

 

The optimization issues remain! I really need to speed up my code.

 

Thank you in advice!

  • Bernd Paradies Adobe Employee 471 posts since
    Jun 18, 2008

    Hello bkamenov,

     

    you mention three problem areas trying to use Alchemy for your game.

    I don't know whether this will be helpful, but let me throw out some suggestions and ideas.

     

     

    RE: Inline functions. Would you please check whether those inline functions that cause trouble use llvm Standard C Library Intrinsics? Those intrinsic functions are memcpy, memmove, memset, sqrt, powi, sin, cos, and pow (for more details see http://llvm.org/docs/LangRef.html#int_libc). We have seen problems with those. You can avoid those problems by forcing llc to not inline those functions (see next section).

     

     

    RE: Compiler optimizations. Would you please check whether your flash.c contains any of those Standard C Library Intrinsics (see above). If flash.c uses memcpy, memmove, or memset, then this workaround might work.

    Copy and paste this snippet into flash.c after your includes:

     

    static void * custom_memmove( void * destination, const void * source, size_t num ) {
     
      void *result;
      __asm__("%0 memmove(%1, %2, %3)\n" : "=r"(result) : "r"(destination), "r"(source), "r"(num));
      return result;
    }
     
     
    static void * custom_memcpy ( void * destination, const void * source, size_t num ) {
      void *result;
     
      __asm__("%0 memcpy(%1, %2, %3)\n" : "=r"(result) : "r"(destination), "r"(source), "r"(num));
      return result;
    }
     
     
     
    static void * custom_memset ( void * ptr, int value, size_t num ) {
      void *result;
      __asm__("%0 memset(%1, %2, %3)\n" : "=r"(result) : "r"(ptr), "r"(value), "r"(num));
      return result;
    }
     
     
    #define memmove custom_memmove
    #define memcpy custom_memcpy
    #define memset custom_memset
    

     

     

    RE: Floating point issues. We did see some problems with bit casting i64 values and this problem looks similar. In this case you might be able to get the correct results by using code that is less ambiguous about casting versus converting. Here is my suggestion:

     

    void audio_prepare_as()
    {
        uint32 i;
        float f0;
        const float f1 = 32768;
     
        for(i=0;i<audioSamples;i+=2)
        {
          f0 = snd.buffer[i];
          audiobuffer[i] = f0/f1;
     
     
          f0 = snd.buffer[i+1];
          audiobuffer[i+1] = f0/f1;
        }
    }
     
    

     

    Best regards,

     

    - Bernd

  • Bernd Paradies Adobe Employee 471 posts since
    Jun 18, 2008

    Hello bkamenov,

     

    I'll take a look at the code you were sharing tomorrow.

    One thing I noticed: you inserted my code snippet with the memcpy/memset/memmove workarounds before the includes.

    They should be after the last include.

     

    Would you please try that?

     

    Thanks,

     

    - Bernd

  • Bernd Paradies Adobe Employee 471 posts since
    Jun 18, 2008

    Very cool!

     

    I get 13-14 fps in Chrome on OSX. Not too shabby.

     

    If you want to do some performance tuning it might be worth setting ACHACKS_TMPS=1 and studying the resulting *.achacks.as file.

    You can experiment and manually fine tune the generated ActionScript code and recompile it using parts of the gcc script.

    This script might help, pass your as file as the first parameter:

     

    #!/bin/bash
     
    SRC=`basename $1 ".as"`
    java -Xms256M -Xmx2048M -jar ${ALCHEMY_HOME}/bin/asc.jar -AS3 -strict -import ${ALCHEMY_HOME}/flashlibs/global.abc -import ${ALCHEMY_HOME}/flashlibs/playerglobal.abc -config Alchemy::Debugger=false -config Alchemy::NoDebugger=true -config Alchemy::Shell=false -config Alchemy::NoShell=true -config Alchemy::LogLevel=10 -config Alchemy::Vector=true -config Alchemy::NoVector=false -config Alchemy::SetjmpAbuse=false -swf cmodule.${SRC}.ConSprite,800,600,60 ${SRC}.as
     
    # open ${SRC}.swf 
     
     
     
    

     

    Viel Glueck!

     

    - Bernd

  • Bernd Paradies Adobe Employee 471 posts since
    Jun 18, 2008

    Yes, I do like the idea of your emulator!

     

    I also understand that you feel that optimizing the generated code is too much work and probably not worth the effort. But 3x speedup might be within reach. For example if you compile your project without llc -avm2-use-memuser your program will slow down by a significant factor - I would estimate between 5x to 10x. The reason is that -avm2-use-memuser tells the Alchemy backend to use fast memory ops instead of slow ByteArray calls.

     

    I would encourage you to search your generated ActionScript file for "gstate.ds.write" and "gstate.ds.read" and you'll see that there are still a lot of places in your ActionScript file that can be replaced with _asm() instructions. I bet that you'll get a much faster SWF just by reimplementing MemUser in "ActionScript assembler".

     

    I don't have a lot of time right now. But I can post an assembler implementation of MemUser if anybody is interested.

     

    Gruss,

     

    - Bernd

  • AlphaTrion Calculating status... 7 posts since
    Jun 18, 2010

    Bernd, your improvements sounds great, I'm glad to read more from you!

     

    ... ungeduldigly mitwart

     

    Gruß

    Bastian

  • Bernd Paradies Adobe Employee 471 posts since
    Jun 18, 2008

    Hello everybody and thanks for your interest in this perhaps rather exotic Alchemy topic.

     

    I was planning on writing up a separate post that goes into more details about what I call the "Alchemy assembler language" but I might not be able to get to that in the next few days. Instead of letting you guys wait forever I am throwing this short version over the fence. Please keep in mind that the "Alchemy assembler language" deserves a much better write up than I am going to do now.

     

    That said, here is my short version: Alchemy does not just translate C code to ActionScript. As part of that transformation process every function becomes a finite state machine (FSM), which uses the same continuous large memory block (similar to the Turing Machine) for allocating new objects and passing parameters. In this post I won't go into detail why FSMs are necessary. Scott explains that in his talk at the 2008 LLVM Dev Conference, please watch this talk:

     

         Flash C Compiler: Compiling C code to the Adobe Flash Virtual Machine

         http://llvm.org/devmtg/2008-08/

     

    Now, Alchemy offers two compile switches that drives how that continuous large memory block will be accessed by all FSMs. With the -avm2-use-memuser option (which is the default, because it is faster than the other option) the Alchemy LLVM backend will generate ActionScript source code that contains inline assembler instructions for ultra-fast memory access. Before I explain those low level memory ops let me point out two important things:

     

    1. Only FlashPlayer versions 10 and higher and Air 1.5 and higher support those fast memory ops.

    2. Only the Alchemy version of asc.jar is capable of compiling inline assembler instructions into ABC and SWF.

     

    If you don't specify -avm2-use-memuser option the Alchemy LLVM backend will generate ActionScript source code that uses regular ActionScript ByteArray operations for reading and writing to memory. That method is significantly slower as Boris has pointed out.

     

    After this introduction let's jump into the details of the "Alchemy assembler language" with regards to reading from and writing to the memory "band".

    The memory op codes are as follows:

     

    li8     0x35     load integer, 8 bits
    li16    0x36     load integer, 16 bits
    li32    0x37     load integer, 32 bits
    lf32    0x38     load float, 32 bits
    lf64    0x39     load double, 64 bits
    si8     0x3a     store integer, 8 bits
    si16    0x3b     store integer, 16 bits
    si32    0x3c     store integer, 32 bits
    sf32    0x3d     store float, 32 bits
    sf64    0x3e     store double, 64 bits
    

    The inline assembler instructions for reading and writing to the memory band are:

     

    Read i32 from ByteArray[addr]:
    __xasm<int>(push(addr), op(0x37));
     
    Write i32 val to ByteArray[addr]:
    __asm(push(val), push(addr), op(0x3c));
    

    With that information you can now write your own assembler version of MemUser:

     

    public class MemUser
    {
         public final function _mr32(addr:int):int { return __xasm<int>(push(addr), op(0x37)); } // li32
         public final function _mru16(addr:int):int { return __xasm<int>(push(addr), op(0x36)); } // li16
         public final function _mrs16(addr:int):int { return __xasm<int>(push(addr), op(0x36)); } // li16
         public final function _mru8(addr:int):int { return __xasm<int>(push(addr), op(0x35)); } // li8
         public final function _mrs8(addr:int):int { return __xasm<int>(push(addr), op(0x35)); } // li8
         public final function _mrf(addr:int):Number { return __xasm<int>(push(addr), op(0x38)); } // lf32
         public final function _mrd(addr:int):Number { return __xasm<int>(push(addr), op(0x39)); } // lf64
         public final function _mw32(addr:int, val:int):void { __asm(push(val), push(addr), op(0x3c)); } // si32
         public final function _mw16(addr:int, val:int):void { __asm(push(val), push(addr), op(0x3b)); } // si16
         public final function _mw8(addr:int, val:int):void { __asm(push(val), push(addr), op(0x3a)); } // si8
         public final function _mwf(addr:int, val:Number):void { __asm(push(val), push(addr), op(0x3d)); } // sf32
         public final function _mwd(addr:int, val:Number):void { __asm(push(val), push(addr), op(0x3e)); } // sf64
    }
    

     

    As I was pointing our earlier in this thread even with specifying -avm2-use-memuser you'll end up with ActionScript code that still contains parts that don't take advantage of the fast memory ops. MemUser is the most obvious candidate. But there are other places where you can replace gstate.ds.read/write calls with inline assembler code. Would it be worth your time? Maybe. It depends on how desperate you are for increasing performance.

     

    I hope that with the information above the task of fine tuning your ActionScript using inline assembler instructions has become less mysterious.

     

    Best wishes,

     

    - Bernd

  • Bernd Paradies Adobe Employee 471 posts since
    Jun 18, 2008

    Hmm, I find it hard to believe that the inline assembler version of MemUser will give you a lower SWF. I recommend using the script above for compiling modified AS files instead of patching gcc:

     

     

    #!/bin/bash
     
    SRC=`basename $1 ".as"`
    java -Xms256M -Xmx2048M -jar ${ALCHEMY_HOME}/bin/asc.jar -AS3 -strict -import ${ALCHEMY_HOME}/flashlibs/global.abc -import ${ALCHEMY_HOME}/flashlibs/playerglobal.abc -config Alchemy::Debugger=false -config Alchemy::NoDebugger=true -config Alchemy::Shell=false -config Alchemy::NoShell=true -config Alchemy::LogLevel=0 -config Alchemy::Vector=true -config Alchemy::NoVector=false -config Alchemy::SetjmpAbuse=false -swf cmodule.${SRC}.ConSprite,800,600,60 ${SRC}

     

     

    Would you please try that?

     

    Thank,

     

    - Bernd

  • Bernd Paradies Adobe Employee 471 posts since
    Jun 18, 2008

    Hello Boris,

     

    the script I provided creates a SWF and not a SWC. It sounds like you are building a SWC and then link that to your Flex/Flash app.

    Let's take this discussion temporarily offline and present the results after everything has been resolved (or not).

    Please contact me directly at bparadie at adobe dot com and send me a zip file of your AS file if you like.

    I'll have a look at it (but probably on Monday).

     

    Thanks!

     

    - Bernd

  • Bernd Paradies Adobe Employee 471 posts since
    Jun 18, 2008

    Hello Boris,

     

    congratulations for getting the sound part working and thank you for summarizing your findings. I am sure a lot of folks appreciate the detailed information - at least I do. Let me just add that the source for Michael Rennie's Quake port can be found at github:

     

    http://github.com/mkr3142/QuakeFlash/commits/master

     

    The compiled version is available at:

    http://www.newgrounds.com/portal/view/470460

     

    It seems that the original problem of making your SWF faster hasn't been resolved yet.

    That means: Back to drawing board! I will look at your AS code later.

     

    Best wishes,

     

    - Bernd

  • Seikent2 Calculating status... 5 posts since
    Jul 14, 2010

    Hey,

     

    I've been working with alchemy for a while, I'm not an expert, but it is good to be able to work with c++ code on flash .

     

    This thread has been very interesting, I'm not so worried about fps as Boris, but I think that in the future it could be very important.

     

    I have 2 doubts, I would be grateful for you answers

     

    - I saw the quake project makefile and it has the -DFLASH -DNO_ASM flags, what are they? I guess that something like "debug flash" and "no use asm if debug".

     

    - How do you set the -avm2-use-memuser ? I mean, I have seen that it is a llc flag, but I don't use llc explicitly, I produce the .swc using alchemy's g++ and then I include the .swc in a flex project. I have the feeling that it is being already used, but I want to be sure.

     

    Thanks!

  • Bernd Paradies Adobe Employee 471 posts since
    Jun 18, 2008

    Hello Seikent2,

     

    Boris is right: in order to set or unset the -avm2-use-memuser flag for llc you need to patch the gcc script, or copy gcc and modify that script. As far as I know -avm2-use-memuser is set by default and the reason why Boris changed those flags was more in the context of performance experiments. He found that the SWF gets significantly slower if you don't use that flag.

     

    In other words: the -avm2-use-memuser flag is set by default. You don't have to do anything. In regards to your other question about -DFLASH and -DNO_ASM. Those are CFLAGS used by Michael Rennie (the author of QuakeFlash) in order to cleanly separate his code modifications from the original Quake code. This is very good practice and in my opinion Michael Rennie did a fantastic job of modifying the code and commenting the changes.

     

    Here is the source code - in case somebody is wondering where it is located:

    http://github.com/mkr3142/QuakeFlash

     

    Best regards,

     

    - Bernd

  • Bernd Paradies Adobe Employee 471 posts since
    Jun 18, 2008

    Hello,

     

    on 6/19 Boris and I took this discussion offline with the plan to share the results of our findings later. He sent me his source code and I poked around a little bit. The results of his and my efforts are now summarized in this post:

     

    Are AS3 timers make use of multi processor cores?

    http://forums.adobe.com/message/2976678#2976678

     

    Best wishes,

     

    - Bernd

More Like This

  • Retrieving data ...

Bookmarked By (0)

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points