Jun 15, 2010 12:27 PM
inline functions in C, gcc optimization and floating point arithmetic issues
-
Like (0)
For several days I really have become a fan of Alchemy. But after intensive testing I have found several issues which I'd like to solve but I can't without any help.
So...I'm porting an old game console emulator written by me in ANSI C. The code is working on both gcc and VisualStudio without any modification or crosscompile macros. The only platform code is the audio and video output which is out of scope, because I have ported audio and video witin AS3.
Here are the issues:
1. Inline functions - Having only a single inline function makes the code working incorrectly (although not crashing) even if any optimization is enabled or not (-O0 or O3). My current workarround is converting the inline functions to macros which achieves the same effect. Any ideas why inline functions break the code?
2. Compiler optimizations - well, my project consists of many C files one of which is called flash.c and it contains the main and exported functions. I build the project as follows:
gcc -c flash.c -O0 -o flash.o //Please note the -O0 option!!!
gcc -c file1.c -O3 -o file1.o
gcc -c file2.c -O3 -o file2.o
... and so on
gcc *.o -swc -O0 -o emu.swc //Please note the -O0 option again!!!
mxmlc.exe -library-path+=emu.swc --target-player=10.0.0 Emu.as
or file in $( ls *.o ) //Removes the obj files
do
rm $file
done
If I define any option different from -O0 in gcc -c flash.c -O0 -o flash.o the program stops working correctly exactly as in the inline funtions code (but still does not crash or prints any errors in debug). flash has 4 static functions to be exported to AS3 and the main function. Do you know why?
If I define any option different from -O0 in gcc *.o -swc -O0 -o emu.swc the program stops working correctly exactly as above, but if I specify -O1, -O2 or O3 the SWC file gets smaller up to 2x for O3. Why? Is there a method to optimize all the obj files except flash.o because I suspect a similar issue as when compilling it?
3. Flating point issues - this is the worst one. My code is mainly based on integer arithmetic but on 1-2 places it requires flating point arithmetic. One of them is the conversion of 16-bit 44.1 Khz sound buffer to a float buffer with same sample rate but with samples in the range from -1.0 to 1.0.
My code:
void audio_prepare_as()
{
uint32 i;
for(i=0;i<audioSamples;i+=2)
{
audiobuffer[i] = (float)snd.buffer[i]/32768;
audiobuffer[i+1] = (float)snd.buffer[i+1]/32768;
}
}
My audio playback is working perfectly. But not if using the above conversion and I have inspected the float numbers - all incorrect and invalid. I tried other code with simple floats - same story. As if alchemy refuses to work with floats. What is wrong? I have another lace whre I must resize the framebuffer and there I have a float involved - same crap. Please help me?
Found the floating point problem: audiobuffer is written to a ByteArray and then used in AS. But C floats are obviously not the same as those in AS3. Now the floating point is resolved.
The optimization issues remain! I really need to speed up my code.
Thank you in advice!
Hello bkamenov,
you mention three problem areas trying to use Alchemy for your game.
I don't know whether this will be helpful, but let me throw out some suggestions and ideas.
RE: Inline functions. Would you please check whether those inline functions that cause trouble use llvm Standard C Library Intrinsics? Those intrinsic functions are memcpy, memmove, memset, sqrt, powi, sin, cos, and pow (for more details see http://llvm.org/docs/LangRef.html#int_libc). We have seen problems with those. You can avoid those problems by forcing llc to not inline those functions (see next section).
RE: Compiler optimizations. Would you please check whether your flash.c contains any of those Standard C Library Intrinsics (see above). If flash.c uses memcpy, memmove, or memset, then this workaround might work.
Copy and paste this snippet into flash.c after your includes:
static void * custom_memmove( void * destination, const void * source, size_t num ) {
void *result;
__asm__("%0 memmove(%1, %2, %3)\n" : "=r"(result) : "r"(destination), "r"(source), "r"(num));
return result;
}
static void * custom_memcpy ( void * destination, const void * source, size_t num ) {
void *result;
__asm__("%0 memcpy(%1, %2, %3)\n" : "=r"(result) : "r"(destination), "r"(source), "r"(num));
return result;
}
static void * custom_memset ( void * ptr, int value, size_t num ) {
void *result;
__asm__("%0 memset(%1, %2, %3)\n" : "=r"(result) : "r"(ptr), "r"(value), "r"(num));
return result;
}
#define memmove custom_memmove
#define memcpy custom_memcpy
#define memset custom_memset
RE: Floating point issues. We did see some problems with bit casting i64 values and this problem looks similar. In this case you might be able to get the correct results by using code that is less ambiguous about casting versus converting. Here is my suggestion:
void audio_prepare_as()
{
uint32 i;
float f0;
const float f1 = 32768;
for(i=0;i<audioSamples;i+=2)
{
f0 = snd.buffer[i];
audiobuffer[i] = f0/f1;
f0 = snd.buffer[i+1];
audiobuffer[i+1] = f0/f1;
}
}
Best regards,
- Bernd
Dear Bernd,
I am still unable to run the optimizations and turn on the inline functions. None of the inline functions contain any stdli function just pure asignments, reads, simple arithmetic and bitwise operations.
In fact, the file containing the main function and those functions for export in AS3 did have memset and memcpy. I tried your suggestion and put the code above the functions calling memset and memcpy. It did not work soe I put the code in a header which is included topmost in each C file. The only system header I use is malloc.h and it is included topmost. In other C file I use pow, sin and log10 from math.h but I removed it and made the same thing:
//shared.h
#ifndef _SHARED_H_
#define _SHARED_H_
#include <malloc.h>
static void * custom_memmove( void * destination, const void * source, unsigned int num ) {
void *result;
__asm__("%0 memmove(%1, %2, %3)\n" : "=r"(result) : "r"(destination), "r"(source), "r"(num));
return result;
}
static void * custom_memcpy ( void * destination, const void * source, unsigned int num ) {
void *result;
__asm__("%0 memcpy(%1, %2, %3)\n" : "=r"(result) : "r"(destination), "r"(source), "r"(num));
return result;
}
static void * custom_memset ( void * ptr, int value, unsigned int num ) {
void *result;
__asm__("%0 memset(%1, %2, %3)\n" : "=r"(result) : "r"(ptr), "r"(value), "r"(num));
return result;
}
static float custom_pow(float x, int y) {
static double custom_sin(double x) {
static double custom_log10(double x) {
#define memmove custom_memmove
#endif /* _SHARED_H_ */
It still behave the same way as if nothing was changed (works incorrectly - displays jerk which does not move, whereby the image is supposed to move)
As I am porting an emulator (Sega Mega Drive) I use manu arrays of function pointers for implementing the opcodes of the CPU's. Could this be an issue?
I did a workaround for the floating point problem but processing is very slow so I hear only bzzt bzzt but this is for now out of scope. The emulator compiled with gcc runs at 300 fps on a 1.3 GHz machine, whereby my non optimized AVM2 code compiled by alchemy produces 14 fps. The pure rendering is super fast and the problem lies in the computational power of AVM. The frame buffer and the enulation are generated in the C code and only the pixels are copied to AS3, where they are plotted in a BitmapData. On 2.0 GHz Dual core I achieved only 21 fps. Goal is 60 fps to have smooth audio and video. But this is offtopic. After all everything works (slow) without optimization, and I would somehow turn it on. Suggestions?
Here is the file with the main function:
#include "shared.h"
#include "AS3.h"
#define FRAMEBUFFER_LENGTH (320*240*4)
static uint8* framebuffer;
static uint32 audioSamples;
AS3_Val sega_rom(void* self, AS3_Val args)
{
int size, offset, i;
uint8 hardware;
uint8 country;
uint8 header[0x200];
uint8 *ptr;
AS3_Val length;
AS3_Val ba;
AS3_ArrayValue(args, "AS3ValType", &ba);
country = 0;
offset = 0;
length = AS3_GetS(ba, "length");
size = AS3_IntValue(length);
ptr = (uint8*)malloc(size);
AS3_SetS(ba, "position", AS3_Int(0));
AS3_ByteArray_readBytes(ptr, ba, size);
//FILE* f = fopen("boris_dump.bin", "wb");
//fwrite(ptr, size, 1, f);
//fclose(f);
if((size / 512) & 1)
{
size -= 512;
offset += 512;
memcpy(header, ptr, 512);
for(i = 0; i < (size / 0x4000); i += 1)
{
deinterleave_block(ptr + offset + (i * 0x4000));
}
}
memset(cart_rom, 0, 0x400000);
if(size > 0x400000) size = 0x400000;
memcpy(cart_rom, ptr + offset, size);
/* Free allocated file data */
free(ptr);
hardware = 0;
for (i = 0x1f0; i < 0x1ff; i++)
switch (cart_rom[i]) {
case 'U':
hardware |= 4;
break;
case 'J':
hardware |= 1;
break;
case 'E':
hardware |= 8;
break;
}
if (cart_rom[0x1f0] >= '1' && cart_rom[0x1f0] <= '9') {
hardware = cart_rom[0x1f0] - '0';
} else if (cart_rom[0x1f0] >= 'A' && cart_rom[0x1f0] <= 'F') {
hardware = cart_rom[0x1f0] - 'A' + 10;
}
if (country) hardware=country; //simple autodetect override
//From PicoDrive
if (hardware&8)
{
hw=0xc0; vdp_pal=1;
} // Europe
else if (hardware&4)
{
hw=0x80; vdp_pal=0;
} // USA
else if (hardware&2)
{
hw=0x40; vdp_pal=1;
} // Japan PAL
else if (hardware&1)
{
hw=0x00; vdp_pal=0;
} // Japan NTSC
else
hw=0x80; // USA
if (vdp_pal) {
vdp_rate = 50;
lines_per_frame = 312;
} else {
vdp_rate = 60;
lines_per_frame = 262;
};
/*SRAM*/
if(cart_rom[0x1b1] == 'A' && cart_rom[0x1b0] == 'R')
{
save_start = cart_rom[0x1b4] << 24 | cart_rom[0x1b5] << 16 |
cart_rom[0x1b6] << 8 | cart_rom[0x1b7] << 0;
save_len = cart_rom[0x1b8] << 24 | cart_rom[0x1b9] << 16 |
cart_rom[0x1ba] << 8 | cart_rom[0x1bb] << 0;
// Make sure start is even, end is odd, for alignment
// A ROM that I came across had the start and end bytes of
// the save ram the same and wouldn't work. Fix this as seen
// fit, I know it could probably use some work. [PKH]
if(save_start != save_len)
{
if(save_start & 1) --save_start;
if(!(save_len & 1)) ++save_len;
save_len -= (save_start - 1);
saveram = (unsigned char*)malloc(save_len);
// If save RAM does not overlap main ROM, set it active by default since
// a few games can't manage to properly switch it on/off.
if(save_start >= (unsigned)size)
save_active = 1;
}
else
{
save_start = save_len = 0;
saveram = NULL;
}
}
else
{
save_start = save_len = 0;
saveram = NULL;
}
return AS3_Int(0);
}
AS3_Val sega_init(void* self, AS3_Val args)
{
system_init();
audioSamples = (44100 / vdp_rate)*2;
framebuffer = (uint8*)malloc(FRAMEBUFFER_LENGTH);
return AS3_Int(vdp_rate);
}
AS3_Val sega_reset(void* self, AS3_Val args)
{
system_reset();
return AS3_Int(0);
}
AS3_Val sega_frame(void* self, AS3_Val args)
{
uint32 width;
uint32 height;
uint32 x, y;
uint32 di, si, r;
uint16 p;
AS3_Val fb_ba;
AS3_ArrayValue(args, "AS3ValType", &fb_ba);
system_frame(0);
AS3_SetS(fb_ba, "position", AS3_Int(0));
width = (reg[12] & 1) ? 320 : 256;
height = (reg[1] & 8) ? 240 : 224;
for(y=0;y<240;y++)
{
for(x=0;x<320;x++)
{
di = 1280*y + x<<2;
si = (y << 10) + ((x + bitmap.viewport.x) << 1);
p = *((uint16*)(bitmap.data + si));
framebuffer[di + 3] = (uint8)((p & 0x1f) << 3);
framebuffer[di + 2] = (uint8)(((p >> 5) & 0x1f) << 3);
framebuffer[di + 1] = (uint8)(((p >> 10) & 0x1f) << 3);
}
}
AS3_ByteArray_writeBytes(fb_ba, framebuffer, FRAMEBUFFER_LENGTH);
AS3_SetS(fb_ba, "position", AS3_Int(0));
r = (width << 16) | height;
return AS3_Int(r);
}
AS3_Val sega_audio(void* self, AS3_Val args)
{
AS3_Val ab_ba;
AS3_ArrayValue(args, "AS3ValType", &ab_ba);
AS3_SetS(ab_ba, "position", AS3_Int(0));
AS3_ByteArray_writeBytes(ab_ba, snd.buffer, audioSamples*sizeof(int16));
AS3_SetS(ab_ba, "position", AS3_Int(0));
return AS3_Int(0);
}
int main()
{
AS3_Val romMethod = AS3_Function(NULL, sega_rom);
AS3_Val initMethod = AS3_Function(NULL, sega_init);
AS3_Val resetMethod = AS3_Function(NULL, sega_reset);
AS3_Val frameMethod = AS3_Function(NULL, sega_frame);
AS3_Val audioMethod = AS3_Function(NULL, sega_audio);
// construct an object that holds references to the functions
AS3_Val result = AS3_Object("sega_rom: AS3ValType, sega_init: AS3ValType, sega_reset: AS3ValType, sega_frame: AS3ValType, sega_audio: AS3ValType",
romMethod, initMethod, resetMethod, frameMethod, audioMethod);
// Release
AS3_Release(romMethod);
AS3_Release(initMethod);
AS3_Release(resetMethod);
AS3_Release(frameMethod);
AS3_Release(audioMethod);
// notify that we initialized -- THIS DOES NOT RETURN!
AS3_LibInit(result);
// should never get here!
return 0;
}
Hello bkamenov,
I'll take a look at the code you were sharing tomorrow.
One thing I noticed: you inserted my code snippet with the memcpy/memset/memmove workarounds before the includes.
They should be after the last include.
Would you please try that?
Thanks,
- Bernd
Hello Bernd,
I have tried this but results are the same. Though, I was able to enable the optimization in antoher way:
I have all my code in a single file flash.c, then I could use: "gcc flash.c -O2 -Wall -swc -o emu.swc" and my code ran correctly, but still as slow as before. The only difference is that the swf file got twice smaller. Inline functions did not worked as before. Suprisingly, they are not executed due to an if statement I use, so they only take part in the final code without being executed at all in runtime. Something with code generation having inline functions is wrong in alchemy.
Finally, I suppose avm2.0 is still > 20x slower than native code. My code uses sin, pow and log10 only on init and in runtime it uses only reads, assignments, simple arithmetic and bitwise operations. Not stdlib functions are used at runtime even a single memset or memcpy in the core emulation code. So, as I said, avm has still a poor performance for real time apps being about > 20x slower than native application in my test. Even a debug compilation in Visual Studio without any optimization or inline functions drawing in software mode is still 9 times faster than avm2.
I really had a hope for bringing Sega Mega Drive to browser multiplayer gaming...
Here is my example without controls though... just a speed test
http://www.mailaufzeit.de/flash.html
Please all users tell the FPS count and the processor your are using.
Very cool!
I get 13-14 fps in Chrome on OSX. Not too shabby.
If you want to do some performance tuning it might be worth setting ACHACKS_TMPS=1 and studying the resulting *.achacks.as file.
You can experiment and manually fine tune the generated ActionScript code and recompile it using parts of the gcc script.
This script might help, pass your as file as the first parameter:
#!/bin/bash
SRC=`basename $1 ".as"`
java -Xms256M -Xmx2048M -jar ${ALCHEMY_HOME}/bin/asc.jar -AS3 -strict -import ${ALCHEMY_HOME}/flashlibs/global.abc -import ${ALCHEMY_HOME}/flashlibs/playerglobal.abc -config Alchemy::Debugger=false -config Alchemy::NoDebugger=true -config Alchemy::Shell=false -config Alchemy::NoShell=true -config Alchemy::LogLevel=10 -config Alchemy::Vector=true -config Alchemy::NoVector=false -config Alchemy::SetjmpAbuse=false -swf cmodule.${SRC}.ConSprite,800,600,60 ${SRC}.as
# open ${SRC}.swf
Viel Glueck!
- Bernd
Vielen Dank, aber optimizing the generated code is too much effort. I request a speed of at least 60 fps which is 3x speedup. I do not think that manual edits will do the job...and the C code is ~50 000 lines of code...too much.
Btw, findest Du die Idee mit dem Emulator gut?
Yes, I do like the idea of your emulator!
I also understand that you feel that optimizing the generated code is too much work and probably not worth the effort. But 3x speedup might be within reach. For example if you compile your project without llc -avm2-use-memuser your program will slow down by a significant factor - I would estimate between 5x to 10x. The reason is that -avm2-use-memuser tells the Alchemy backend to use fast memory ops instead of slow ByteArray calls.
I would encourage you to search your generated ActionScript file for "gstate.ds.write" and "gstate.ds.read" and you'll see that there are still a lot of places in your ActionScript file that can be replaced with _asm() instructions. I bet that you'll get a much faster SWF just by reimplementing MemUser in "ActionScript assembler".
I don't have a lot of time right now. But I can post an assembler implementation of MemUser if anybody is interested.
Gruss,
- Bernd
Wow, you are telling very interesting stories...
Unfortunatelly, I did not understand a word. What to reimplement and where and how? Please provide me a step by step description! How to compile with -avm2-use-memuser - again step by step?
What I found in the generated file is:
public class MemUser
{
public final function _mr32(addr:int):int { gstate.ds.position = addr; return gstate.ds.readInt(); }
public final function _mru16(addr:int):int { gstate.ds.position = addr; return gstate.ds.readUnsignedShort(); }
public final function _mrs16(addr:int):int { gstate.ds.position = addr; return gstate.ds.readShort(); }
public final function _mru8(addr:int):int { gstate.ds.position = addr; return gstate.ds.readUnsignedByte(); }
public final function _mrs8(addr:int):int { gstate.ds.position = addr; return gstate.ds.readByte(); }
public final function _mrf(addr:int):Number { gstate.ds.position = addr; return gstate.ds.readFloat(); }
public final function _mrd(addr:int):Number { gstate.ds.position = addr; return gstate.ds.readDouble(); }
public final function _mw32(addr:int, val:int):void { gstate.ds.position = addr; gstate.ds.writeInt(val); }
public final function _mw16(addr:int, val:int):void { gstate.ds.position = addr; gstate.ds.writeShort(val); }
public final function _mw8(addr:int, val:int):void { gstate.ds.position = addr; gstate.ds.writeByte(val); }
public final function _mwf(addr:int, val:Number):void { gstate.ds.position = addr; gstate.ds.writeFloat(val); }
public final function _mwd(addr:int, val:Number):void { gstate.ds.position = addr; gstate.ds.writeDouble(val); }
}
and those methods are used in each push, pop! I am waiting for your answer ungeduldigly.
Danke im Voraus!
Bernd, your improvements sounds great, I'm glad to read more from you!
... ungeduldigly mitwart
Gruß
Bastian
Some update - I'm always using -avm2-use-memuser, but for fun I tried without it and slow down was slightly above 3x! Would this mean that the ASM reimplementation may reach at least 3x speed up?
We are waiting even more ungeduldigly
for your ASM code and instructions on how to use it.
Gruss
Boris
Hello everybody and thanks for your interest in this perhaps rather exotic Alchemy topic.
I was planning on writing up a separate post that goes into more details about what I call the "Alchemy assembler language" but I might not be able to get to that in the next few days. Instead of letting you guys wait forever I am throwing this short version over the fence. Please keep in mind that the "Alchemy assembler language" deserves a much better write up than I am going to do now.
That said, here is my short version: Alchemy does not just translate C code to ActionScript. As part of that transformation process every function becomes a finite state machine (FSM), which uses the same continuous large memory block (similar to the Turing Machine) for allocating new objects and passing parameters. In this post I won't go into detail why FSMs are necessary. Scott explains that in his talk at the 2008 LLVM Dev Conference, please watch this talk:
Flash C Compiler: Compiling C code to the Adobe Flash Virtual Machine
http://llvm.org/devmtg/2008-08/
Now, Alchemy offers two compile switches that drives how that continuous large memory block will be accessed by all FSMs. With the -avm2-use-memuser option (which is the default, because it is faster than the other option) the Alchemy LLVM backend will generate ActionScript source code that contains inline assembler instructions for ultra-fast memory access. Before I explain those low level memory ops let me point out two important things:
1. Only FlashPlayer versions 10 and higher and Air 1.5 and higher support those fast memory ops.
2. Only the Alchemy version of asc.jar is capable of compiling inline assembler instructions into ABC and SWF.
If you don't specify -avm2-use-memuser option the Alchemy LLVM backend will generate ActionScript source code that uses regular ActionScript ByteArray operations for reading and writing to memory. That method is significantly slower as Boris has pointed out.
After this introduction let's jump into the details of the "Alchemy assembler language" with regards to reading from and writing to the memory "band".
The memory op codes are as follows:
li8 0x35 load integer, 8 bits
li16 0x36 load integer, 16 bits
li32 0x37 load integer, 32 bits
lf32 0x38 load float, 32 bits
lf64 0x39 load double, 64 bits
si8 0x3a store integer, 8 bits
si16 0x3b store integer, 16 bits
si32 0x3c store integer, 32 bits
sf32 0x3d store float, 32 bits
sf64 0x3e store double, 64 bits
The inline assembler instructions for reading and writing to the memory band are:
Read i32 from ByteArray[addr]:
__xasm<int>(push(addr), op(0x37));
Write i32 val to ByteArray[addr]:
__asm(push(val), push(addr), op(0x3c));
With that information you can now write your own assembler version of MemUser:
public class MemUser
{
public final function _mr32(addr:int):int { return __xasm<int>(push(addr), op(0x37)); } // li32
public final function _mru16(addr:int):int { return __xasm<int>(push(addr), op(0x36)); } // li16
public final function _mrs16(addr:int):int { return __xasm<int>(push(addr), op(0x36)); } // li16
public final function _mru8(addr:int):int { return __xasm<int>(push(addr), op(0x35)); } // li8
public final function _mrs8(addr:int):int { return __xasm<int>(push(addr), op(0x35)); } // li8
public final function _mrf(addr:int):Number { return __xasm<int>(push(addr), op(0x38)); } // lf32
public final function _mrd(addr:int):Number { return __xasm<int>(push(addr), op(0x39)); } // lf64
public final function _mw32(addr:int, val:int):void { __asm(push(val), push(addr), op(0x3c)); } // si32
public final function _mw16(addr:int, val:int):void { __asm(push(val), push(addr), op(0x3b)); } // si16
public final function _mw8(addr:int, val:int):void { __asm(push(val), push(addr), op(0x3a)); } // si8
public final function _mwf(addr:int, val:Number):void { __asm(push(val), push(addr), op(0x3d)); } // sf32
public final function _mwd(addr:int, val:Number):void { __asm(push(val), push(addr), op(0x3e)); } // sf64
}
As I was pointing our earlier in this thread even with specifying -avm2-use-memuser you'll end up with ActionScript code that still contains parts that don't take advantage of the fast memory ops. MemUser is the most obvious candidate. But there are other places where you can replace gstate.ds.read/write calls with inline assembler code. Would it be worth your time? Maybe. It depends on how desperate you are for increasing performance.
I hope that with the information above the task of fine tuning your ActionScript using inline assembler instructions has become less mysterious.
Best wishes,
- Bernd
I have changed the MemUser class with the ASM version and I got 4 fps slower SWF.
My AS file is called 1620.achacks.as so
I modified the gcc compilation to:
#sys(@llc, "-o=".($last = "1620.achacks.as"), @ll, @oo); //Do not create the as file to avoid overwrite
$last = "1620.achacks.as";
And renamed all $$.achacks strings to 1620.achacks
gstate.ds.write is referenced extra in memset, memmove and memcpy. But these functions are used only on init where speed is irrelevant.
It is terrible...
Thank you Bernd
Hmm, I find it hard to believe that the inline assembler version of MemUser will give you a lower SWF. I recommend using the script above for compiling modified AS files instead of patching gcc:
#!/bin/bash
SRC=`basename $1 ".as"`
java -Xms256M -Xmx2048M -jar ${ALCHEMY_HOME}/bin/asc.jar -AS3 -strict -import ${ALCHEMY_HOME}/flashlibs/global.abc -import ${ALCHEMY_HOME}/flashlibs/playerglobal.abc -config Alchemy::Debugger=false -config Alchemy::NoDebugger=true -config Alchemy::Shell=false -config Alchemy::NoShell=true -config Alchemy::LogLevel=0 -config Alchemy::Vector=true -config Alchemy::NoVector=false -config Alchemy::SetjmpAbuse=false -swf cmodule.${SRC}.ConSprite,800,600,60 ${SRC}
Would you please try that?
Thank,
- Bernd
Dear Bernd,
please provide the whole bash script from compiling the modified as file to the creation of the SWC. I am a windows guy with no knoledge about bash and its black magic. I was able to build with alchemy only strictly following the instructions the manual and you provide. Would you take a look at the AS files if I send them to you per email? It'd be very kind.
Regards
Boris
Hello Boris,
the script I provided creates a SWF and not a SWC. It sounds like you are building a SWC and then link that to your Flex/Flash app.
Let's take this discussion temporarily offline and present the results after everything has been resolved (or not).
Please contact me directly at bparadie at adobe dot com and send me a zip file of your AS file if you like.
I'll have a look at it (but probably on Monday).
Thanks!
- Bernd
Dear friends,
there are several news regarding the issues discussed in this topic:
1. Floating points - there is no error while working with floats/doubles as I've initiallly thought. I wanted to write a float array to a AS byte array in C:
static AS3_Val sega_audio(void* self, AS3_Val args)
{
uint32 i;
AS3_Val ab_ba;
AS3_ArrayValue(args, "AS3ValType", &ab_ba);
AS3_SetS(ab_ba, "position", AS3_Int(0));
for(i=0;i<audioSamples;i++)
audiobuffer[i] = (float)snd.output[i] / 32768.0f;
AS3_ByteArray_writeBytes(ab_ba, audiobuffer, audioSamples*sizeof(float));
AS3_SetS(ab_ba, "position", AS3_Int(0));
return AS3_Int(0);
}
Result was nosy and incorrect sound. Inspecting the floats -> incorrect numbers. Thanks to Bernd, who pointed me to the code of Doom ported for alchemy, I have discovered that Flash is actually holding the data in byte arrays in little endian (reversed). So all the hours of trying fixing the audio code were invane. The solution was a simple float reverse function:
STATIC_INLINE float FloatSwap (float f)
{
union
{
float f;
uint8 b[4];
} dat1, dat2;
dat1.f = f;
dat2.b[0] = dat1.b[3];
dat2.b[1] = dat1.b[2];
dat2.b[2] = dat1.b[1];
dat2.b[3] = dat1.b[0];
return dat2.f;
}
static AS3_Val sega_audio(void* self, AS3_Val args)
{
uint32 i;
AS3_Val ab_ba;
AS3_ArrayValue(args, "AS3ValType", &ab_ba);
AS3_SetS(ab_ba, "position", AS3_Int(0));
for(i=0;i<audioSamples;i++)
audiobuffer[i] = FloatSwap((float)snd.output[i] / 32768.0f);
AS3_ByteArray_writeBytes(ab_ba, audiobuffer, audioSamples*4);
AS3_SetS(ab_ba, "position", AS3_Int(0));
return AS3_Int(0);
}
So, my floating point problems were RESOLVED!
2. Inline funtions - Well, I was able to resolve the issue of not working code with inlines for one particular project (see above). Actually, solutions are two, while the second is not very useful for large projects with many inline functions because you will spend your life time in correcting the code as an idiot, which is on the other side error prown. Here are the solutions:
Solution 1: Make your code as clear as possible regarding the includes! Do not make a single header including all others and then use it everywhere. This may be not a problem for MSVC or normal GCC but for the alchemy version this might (but not must!) be a problem.
Solution 2: Convert your inline functions to macros. This will achieve same speed as inline functions. But you may need not to forget that inlines are functions so each rule for copying the arguments is still there! Another problem is for the inline functions returning a value and have more than one statement. You must then rewrite it, and eventually make all needed changes in your code where you call the function (now a macro). This may be painful - please believe me!
This issue was RESOLVED, too! But please see point 3 below!
3. GCC Optimization problems - here things are still not really clear for me. Well I am porting a sega mega drive emulator to flash, which is still very slow. Just for the experiment I have ported one emulator for the predecessor of mega drive - the sega master system. Initially, it was slow, too. I was compilling separately each C file with -O3 and linking them together in a SWC which is later used in the actual AS project. It was again slow. I turned on the inline functions - again slow. Then I have included all C files in the main.c file and triend again compiling with:
gcc main.c -O3 -Wall -swc -o emu.swc
Great speed up was obvious, but still not enough to have smooth graphics and sound. Then I discovered that my approach for game (main) loop was wrong. Basicly it was:
private var thread:Timer = new Timer(1);
thread.start();
thread.addEventListener(TimerEvent.TIMER, run);
private function run(e:TimerEvent):void
{
if(is time so that this will run at 60 fps)
{
//One frame is done here
}
else
return;
}
This is WRONG!!!
Basicly, a game must run at 50 or 60 fps depending on if it is PAL or NTSC TV standard. After looking at the code of Doom I have seen that they do something else to draw the frames:
addEventListener(Event.ENTER_FRAME, onFrame);
private function onFrame(event:Event):void
{
//One frame is done here
}
So, I did too. But fps were never > 24 fps. Then I discovered that a sprite object has a stage member where you can get/set frameRate. I plotted it and it was 24. So before executing the game I set it to 50 or 60 depending on the TV standard. And it worked! Full speed and sound and everything (but still everything included in one single file) and built with:
gcc main.c -O3 -Wall -swc -o emu.swc
Compiling separately leads to slow code.
Anyway, I was very happy and spammed Bernd with cheerful messages. I've then tried the same with the megadrive emulator but in reverse order -> stop using the thread timer, fixing the sound and then include everything in the main.c and compile it.
No visual speed changes were detected after changing to ENTER_FRAME, but sound worked (slow but correct). After including everything in the main file, code compiled without errors and warning but stopped working (no image, no sound), though still running. Then, I reversed back to separate compilation but with enabled inline functions. NO SPEED UP! On native platforms enabling them gives at least 2x faster run. With -O3 turned on - 6x speedup. Here nothing. I build the code with a file containg the following:
echo Compilling main.c
gcc -c main.c -O3 -o main.o
echo Compilling fm.c
gcc -c fm.c -O3 -o fm.o
echo Compilling genesis.c
gcc -c genesis.c -O3 -o genesis.o
echo Compilling input.c
gcc -c input.c -O3 -o input.o
echo Compilling io.c
gcc -c io.c -O3 -o io.o
echo Compilling loadrom.c
gcc -c loadrom.c -O3 -o loadrom.o
echo Compilling m68kcpu.c
gcc -c m68kcpu.c -O3 -o m68kcpu.o
echo Compilling m68kopac.c
gcc -c m68kopac.c -O3 -o m68kopac.o
echo Compilling m68kopdm.c
gcc -c m68kopdm.c -O3 -o m68kopdm.o
echo Compilling m68kopnz.c
gcc -c m68kopnz.c -O3 -o m68kopnz.o
echo Compilling m68kops.c
gcc -c m68kops.c -O3 -o m68kops.o
echo Compilling mem68k.c
gcc -c mem68k.c -O3 -o mem68k.o
echo Compilling membnk.c
gcc -c membnk.c -O3 -o membnk.o
echo Compilling memvdp.c
gcc -c memvdp.c -O3 -o memvdp.o
echo Compilling memz80.c
gcc -c memz80.c -O3 -o memz80.o
echo Compilling render.c
gcc -c render.c -O3 -o render.o
echo Compilling sn76496.c
gcc -c sn76496.c -O3 -o sn76496.o
echo Compilling sound.c
gcc -c sound.c -O3 -o sound.o
echo Compilling system.c
gcc -c system.c -O3 -o system.o
echo Compilling vdp.c
gcc -c vdp.c -O3 -o vdp.o
echo Compilling z80.c
gcc -c z80.c -O3 -o z80.o
echo Building SWC file...
gcc *.o -Wall -swc -O3 -o emu.swc
echo Building SWF file...
mxmlc.exe -library-path+=emu.swc --target-player=10.0.0 -default-size 320 240 Emu.as
echo Removing temporary data...
for file in $( ls *.o )
do
rm $file
done
echo Ready!
Is this correct? Ideas how to make it work faster?
Bernd, did you have time to check the AS code?
Hello Boris,
congratulations for getting the sound part working and thank you for summarizing your findings. I am sure a lot of folks appreciate the detailed information - at least I do. Let me just add that the source for Michael Rennie's Quake port can be found at github:
http://github.com/mkr3142/QuakeFlash/commits/master
The compiled version is available at:
http://www.newgrounds.com/portal/view/470460
It seems that the original problem of making your SWF faster hasn't been resolved yet.
That means: Back to drawing board! I will look at your AS code later.
Best wishes,
- Bernd
Hey,
I've been working with alchemy for a while, I'm not an expert, but it is good to be able to work with c++ code on flash
.
This thread has been very interesting, I'm not so worried about fps as Boris, but I think that in the future it could be very important.
I have 2 doubts, I would be grateful for you answers ![]()
- I saw the quake project makefile and it has the -DFLASH -DNO_ASM flags, what are they? I guess that something like "debug flash" and "no use asm if debug".
- How do you set the -avm2-use-memuser ? I mean, I have seen that it is a llc flag, but I don't use llc explicitly, I produce the .swc using alchemy's g++ and then I include the .swc in a flex project. I have the feeling that it is being already used, but I want to be sure.
Thanks!
Hello,
about your second question:
You are using llc correctly. If you take a look in your "alchemy\achacks" dir, you'll find a file called "gcc". There llc is called with all the things you not explicitly write.
And yes I think writing code in c/c++ for web is very cool. I, personally, hate Java and never succeeded to make use of MS Silverlight. Alchemy is just so cute! Bravo
for the team which created it and shared it to us!
Hello Seikent2,
Boris is right: in order to set or unset the -avm2-use-memuser flag for llc you need to patch the gcc script, or copy gcc and modify that script. As far as I know -avm2-use-memuser is set by default and the reason why Boris changed those flags was more in the context of performance experiments. He found that the SWF gets significantly slower if you don't use that flag.
In other words: the -avm2-use-memuser flag is set by default. You don't have to do anything. In regards to your other question about -DFLASH and -DNO_ASM. Those are CFLAGS used by Michael Rennie (the author of QuakeFlash) in order to cleanly separate his code modifications from the original Quake code. This is very good practice and in my opinion Michael Rennie did a fantastic job of modifying the code and commenting the changes.
Here is the source code - in case somebody is wondering where it is located:
http://github.com/mkr3142/QuakeFlash
Best regards,
- Bernd
Hello,
on 6/19 Boris and I took this discussion offline with the plan to share the results of our findings later. He sent me his source code and I poked around a little bit. The results of his and my efforts are now summarized in this post:
Are AS3 timers make use of multi processor cores?
http://forums.adobe.com/message/2976678#2976678
Best wishes,
- Bernd
Copyright © 2011 Adobe Systems Incorporated. All rights reserved.
Use of this website signifies your agreement to the Terms of Use and Online Privacy Policy (updated 07-14-2009).