[CUDA] RC6 / Serpent / Twofish / AES / MARS

AW: [CUDA] RC6 / Serpent

Okay. Ich packe dir auch den RC6 dazu. Ich habe nämlich dort jetzt auch wirklich vor ALLE cuda Aufrufe ein HandleError geschrieben und dabei ist noch 1 Fehler in Zeile 150 rausgekommen. Der müsste in RC6 noch weg, und in Serpent eben das Problem mit dem key ...

Bitte verrate mir dann auch deine Grafik-Hardware und deine Performance-Ergebnisse! Würde mich echt interessieren. Langsamer als mein Laptop hier kann's ja nicht mehr sein. ;)

Anhang anzeigen RC6_Serpent.zip
 
Zuletzt bearbeitet:
AW: [CUDA] RC6 / Serpent

Also in deinem RC6 Code sind mir noch zwei Fehler aufgefallen. Einmal ist deine Berechnung für den Datendurchsatz falsch. Du hast hinten stehten elapsedTime*1000 aber von Millisekunden zu Sekunden musst du natürlich elapsedTime/1000 rechnen. Außerdem kommt bei "After decryption" nicht wieder überall 0 raus, da du in deiner rc6dec funktion nicht rc6_dec sondern rc6_enc aufrufst. Statt einmal zu verschlüsseln und wieder zu entschlüsseln hast du 2mal verschlüsselt. Wenn ich das änder dann scheint aber alles zu laufen. Performance auf meiner GTX 560 Ti (zufälligerweise genau die Karte in dem System wo du das nacher testen willst) sind 1097MB/s.

Jetzt zu Serpent. In Zeile 326 und 327 fehlt noch das __device__ damit die Warnings weg gehen. Außerdem gleicher Fehler bei der Berechung des Datendurchsatzes. Der andere Fehler warum es abstürzt ist wohl bei der Größe von ks_var. Leider kann ich da grade auch keinen genauen Wert finden wie groß ks_var sein muss aber 1024 Elemente scheint zu reichen. Da werden wohl ähnlich wie bei RC6 mehrere Rundenschlüssel generiert weshalb 32 Byte nicht reichen. Performance ist bei mir 4675 MB/s.
 
AW: [CUDA] RC6 / Serpent

Alter Schwede, wenn ich in RC6 noch die beiden von dir genannten Fehler ausbessere, komme ich auf eine völlig utopische Performance:

snapshot28.png

^^ Da wäre ja meine 9600M GT um Lichtjahre schneller als deine GTX 560 Ti. Da stimmt noch irgendwas nicht?!

Und warnings gibt's nach den beiden Änderungen auch, die ich zuvor nicht hatte:

snapshot29.png
-----------

In Serpent komme ich nach dem Ausbessern der Fehler auf eine ähnlich unrealistische Performance:

snapshot30.png

:huh:

[Edit]
Sehe gerade, dass beide Codes bei mir die exakt selbe, falsche Performance liefern. Also haben wohl beide Codes bei mir noch 1 und den selben Fehler?!

[Edit2]
Bei weiteren Durchläufen erscheinen jetzt bei keinem Code mehr warnings, aber die Performance bleibt bei beiden völlig unrealistisch.
 
Zuletzt bearbeitet:
AW: [CUDA] RC6 / Serpent

Ich vermute es liegt daran, dass deine GPU gar keine 1024 Threads pro Block unterstützt und der Kernel deshalb gar nicht startet. Mich wundert nur, dass es keine Fehlermeldung deshalb gibt. Mach mal ein HandleError(cudaGetLastError()); nach den beiden Kernelaufrufen. Du solltest jedenfalls auch mal mit weniger Threads pro Block probieren.
 
Zuletzt bearbeitet:
AW: [CUDA] RC6 / Serpent

Habe jetzt noch bei beiden Codes die zusätzliche HandleError Zeile eingefügt und 512 Threads sowie 12800 Blocks eingestellt.

Serpent läuft damit scheinbar fehlerfrei durch und liefert folgendes durchaus realistisches Ergebnis:

snapshot33.png

RC6 liefert in Zeile 178 folgenden Fehler:

snapshot34.png

^^ Habe ich da immer noch irgendwelche Daten vergessen in den VRAM zu kopieren, oder was heißt das diesmal?
---------

BTW: Kannst du eigentlich deine Performance-Ergebnisse reproduzieren? Wenn du die Codes jeweils 10 Mal hintereinander laufen lässt, kriegst du dann immer exakt 1097 MB/s bzw. 4675 MB/s raus, oder unterschiedliche aber sehr ähnliche Werte?

Bei mir war's im fehlerhaften RC6 so, dass ich im Bereich von 5 MB/s - 75 MB/s alles mögliche rausgekriegt habe, und jetzt im "fehlerfreien" Serpent immer ganz exakt 22 MB/s rauskriege. RC6 läuft ja jetzt, nach den Korrekturen, noch nicht wieder. :(
 
Zuletzt bearbeitet:
AW: [CUDA] RC6 / Serpent

Wenn du die beiden Sachen zu RC6 behoben hast und sonst der Code ist wie im Archiv kann es eigentlich nur an der Hardware liegen, die irgendwas nicht unterstützt. Dazu habe ich den Code nochmal auf meinem Laptop mit einer nVidia NVS 3100M (unterstützt Compute Capability 1.2 also noch immer etwas neuer als deine GPU) probiert aber selbst dort läuft alles mit 512 Threads pro Block. Ich kann mir also nicht erklären woher der Fehler kommt. Ich werde aber nochmal versuchen herauszufinden ob sich irgendwas wichtiges zwischen Compute Capability 1.0 und 1.2 geändert hat. Ansonsten musst du dich damit abfinden, dass es auf deinem PC wohl nicht läuft aber dann auf dem späteren Testsystem schon.

Die Performance Ergebnisse sind bei mir ziemlich reproduzierbar. Jetzt auf dem Laptop kriege ich eine Performance von konstant 40MB/s bei RC6 aber wenn ich mir ein paar mehr Nachkommastellen anzeigen lasse sieht man, dass es leicht schwankt:
Performance: 40.271107 MB/sec
Performance: 40.271900 MB/sec
Performance: 40.271191 MB/sec

Edit: Du kannst mal versuchen dein RC6 Programm mit cuda-memcheck zu prüfen. Dazu einfach mit einer Console in den Ordner wechseln wo das fertig compilierte Programm ist und "cuda-memcheck *programm name*" eintippen. Meistens hat unspecified launch failure irgendwas mit fehlerhaften Speicherzugriffen zu tun. Eventuell bietet deine Entwicklungsumgebung auch direkt möglichkeiten an Code zu debuggen. Das wäre in jedem Fall nötig jetzt, da ich den Fehler nicht reproduzieren kann.
 
Zuletzt bearbeitet:
AW: [CUDA] RC6 / Serpent

Es geht im Prinzip eh nur darum, dass das dann auf dem Benchmarksystem einwandfrei läuft. Jetzt beim Arbeiten ist es halt blöd, dass ich nur die 9600M GT im Laptop als einzige nVidia Hardware habe und die am untersten Limit werkt ...

Wenn die Ergebnisse bei dir so konstant bleiben, brauche ich mich nicht mehr zu wundern, wenn's bei mir jetzt auch so ist. Mit mehreren Nachkommastellen sieht man auch, dass die 22 MB/sec nur gerundet waren:

snapshot35.png

Eine einzige Vorgabe des Projekts fehlt uns jetzt noch: Der Rundenschlüssel soll jedes Mal inkrementiert werden.

RC6 läuft defaultmäßig mit 20 Runden. Der erste Schlüssel errechnet sich nach meinem Verständnis aus unserer key-Phrase?! Aber wo enstehen die weiteren Schlüssel? Ich kann ja nicht einfach am Ende von rc6_initl() "key++;" sagen?!

In serpent muss man sich wohl serpent_set_key() anschauen, allerdings habe ich noch nicht durchschaut, mit wie vielen Runden unser Serpent-Code läuft ...
 
Zuletzt bearbeitet:
AW: [CUDA] RC6 / Serpent

Da ich die genaue Aufgabenstellung nicht kenne kann ich nur spekulieren was gemeint ist. Den Schlüssel jedes mal zu ändern ist eigentlich unüblich und vermutlich nicht gemeint. Die Schlüssel für die 20 Runden bei RC6 werden nach eine festen Regel aus dem Key-Phrase generiert. Ist eventuell das hier gemeint?
 
AW: [CUDA] RC6 / Serpent

^^ Wenn ich das selbst so genau wüsste. Das ist der einzige Teil der Vorgabe, den ich nicht kapiert habe ...
Soweit ich mich ans erste Projektmeeting erinnern kann, ist es NICHT erforderlich, dass wir uns mit den Cipher Modes beschäftigen. Aber ansonsten fällt mir auch nichts ein, was wir mit dem key machen sollen. Ab morgen erreiche ich den Lehrer wieder per email. Heben wir uns das bis zu seiner Antwort auf.

Ich schaue mir jetzt mal vorsichtig Twofish an und melde mich wieder, sobald ich kuriose Ergebnisse kriege, oder wo anstehe. ;)
 
Zuletzt bearbeitet:
AW: [CUDA] RC6 / Serpent /Twofish

Twofish

Habe mich jetzt für einen anderen Twofish-Code entschieden, als der, den ich mittlerweile aus dem vorigen Posting wieder gelöscht habe. Ich will nicht mit ~20 Files arbeiten müssen, wenn's auch mit 5 geht. Außerdem ist es mir mittlerweile lieber, einen C Code nach CUDA zu portieren, als einen fremden CUDA-Code umzubasteln ...

Jedenfalls läuft das Portieren von Twofish bis jetzt ganz gut. Ich bin nur gerade auf folgendes Problem gestoßen (siehe Errors):

snapshot37.png

Kann ich das einfach lösen, indem ich ReverseRoundSubkeys() auch zu einer __device__ Funktion mache?
Was die warnings bedeuten ist eh klar. cipherInstance *cipher und keyInstance *key muss ich noch initialisieren. int inputLen ist hoffentlich richtig initialisiert.

Hier der Code mit meinen Ergänzungen:
Code:
/***************************************************************************
    TWOFISH2.C    -- Optimized C API calls for TWOFISH AES submission

    Submitters:
        Bruce Schneier, Counterpane Systems
        Doug Whiting,    Hi/fn
        John Kelsey,    Counterpane Systems
        Chris Hall,        Counterpane Systems
        David Wagner,    UC Berkeley

    Code Author:        Doug Whiting,    Hi/fn

    Version  1.00        April 1998

    Copyright 1998, Hi/fn and Counterpane Systems.  All rights reserved.

    Notes:
        *    Optimized version
        *    Tab size is set to 4 characters in this file

***************************************************************************/
#include    "AES.h"
#include    "TABLE.h"

#include    <memory.h>
#include    <assert.h>

[COLOR=royalblue][B]#define HandleError(x) HandleErrorImpl(x, __FILE__, __LINE__)[/B]
#if   defined(min_key)  && !defined(MIN_KEY)
#define    MIN_KEY        1            /* toupper() */
#elif defined(part_key) && !defined(PART_KEY)
#define    PART_KEY    1
#elif defined(zero_key) && !defined(ZERO_KEY)
#define    ZERO_KEY    1
#endif


#ifdef USE_ASM
extern    int    useAsm;                /* ok to use ASM code? */

typedef    int cdecl CipherProc
   (cipherInstance *cipher, keyInstance *key,BYTE *input,int inputLen,BYTE *outBuffer);
typedef int    cdecl KeySetupProc(keyInstance *key);

extern CipherProc    *blockEncrypt_86;    /* ptr to ASM functions */
extern CipherProc    *blockDecrypt_86;
extern KeySetupProc    *reKey_86;
extern DWORD        cdecl TwofishAsmCodeSize(void);
#endif

/*
+*****************************************************************************
*            Constants/Macros/Tables
-****************************************************************************/

#define        CONST                    /* help syntax from C++, NOP here */

CONST        fullSbox MDStab;        /* not actually const.  Initialized ONE time */
int            needToBuildMDS=1;        /* is MDStab initialized yet? */

#define        BIG_TAB        0

#if BIG_TAB
BYTE        bigTab[4][256][256];    /* pre-computed S-box */
#endif

/* number of rounds for various key sizes:  128, 192, 256 */
/* (ignored for now in optimized code!) */
CONST int    numRounds[4]= {0,ROUNDS_128,ROUNDS_192,ROUNDS_256};

#if REENTRANT
#define        _sBox_     key->sBox8x32
#else
static        fullSbox _sBox_;        /* permuted MDStab based on keys */
#endif
#define _sBox8_(N) (((BYTE *) _sBox_) + (N)*256)

/*------- see what level of S-box precomputation we need to do -----*/
#if   defined(ZERO_KEY)
#define    MOD_STRING    "(Zero S-box keying)"
#define    Fe32_128(x,R)    \
    (    MDStab[0][p8(01)[p8(02)[_b(x,R  )]^b0(SKEY[1])]^b0(SKEY[0])] ^    \
        MDStab[1][p8(11)[p8(12)[_b(x,R+1)]^b1(SKEY[1])]^b1(SKEY[0])] ^    \
        MDStab[2][p8(21)[p8(22)[_b(x,R+2)]^b2(SKEY[1])]^b2(SKEY[0])] ^    \
        MDStab[3][p8(31)[p8(32)[_b(x,R+3)]^b3(SKEY[1])]^b3(SKEY[0])] )
#define    Fe32_192(x,R)    \
    (    MDStab[0][p8(01)[p8(02)[p8(03)[_b(x,R  )]^b0(SKEY[2])]^b0(SKEY[1])]^b0(SKEY[0])] ^ \
        MDStab[1][p8(11)[p8(12)[p8(13)[_b(x,R+1)]^b1(SKEY[2])]^b1(SKEY[1])]^b1(SKEY[0])] ^ \
        MDStab[2][p8(21)[p8(22)[p8(23)[_b(x,R+2)]^b2(SKEY[2])]^b2(SKEY[1])]^b2(SKEY[0])] ^ \
        MDStab[3][p8(31)[p8(32)[p8(33)[_b(x,R+3)]^b3(SKEY[2])]^b3(SKEY[1])]^b3(SKEY[0])] )
#define    Fe32_256(x,R)    \
    (    MDStab[0][p8(01)[p8(02)[p8(03)[p8(04)[_b(x,R  )]^b0(SKEY[3])]^b0(SKEY[2])]^b0(SKEY[1])]^b0(SKEY[0])] ^ \
        MDStab[1][p8(11)[p8(12)[p8(13)[p8(14)[_b(x,R+1)]^b1(SKEY[3])]^b1(SKEY[2])]^b1(SKEY[1])]^b1(SKEY[0])] ^ \
        MDStab[2][p8(21)[p8(22)[p8(23)[p8(24)[_b(x,R+2)]^b2(SKEY[3])]^b2(SKEY[2])]^b2(SKEY[1])]^b2(SKEY[0])] ^ \
        MDStab[3][p8(31)[p8(32)[p8(33)[p8(34)[_b(x,R+3)]^b3(SKEY[3])]^b3(SKEY[2])]^b3(SKEY[1])]^b3(SKEY[0])] )

#define    GetSboxKey    DWORD SKEY[4];    /* local copy */ \
                    memcpy(SKEY,key->sboxKeys,sizeof(SKEY));
/*----------------------------------------------------------------*/
#elif defined(MIN_KEY)
#define    MOD_STRING    "(Minimal keying)"
#define    Fe32_(x,R)(MDStab[0][p8(01)[_sBox8_(0)[_b(x,R  )]] ^ b0(SKEY0)] ^ \
                   MDStab[1][p8(11)[_sBox8_(1)[_b(x,R+1)]] ^ b1(SKEY0)] ^ \
                   MDStab[2][p8(21)[_sBox8_(2)[_b(x,R+2)]] ^ b2(SKEY0)] ^ \
                   MDStab[3][p8(31)[_sBox8_(3)[_b(x,R+3)]] ^ b3(SKEY0)])
#define sbSet(N,i,J,v) { _sBox8_(N)[i+J] = v; }
#define    GetSboxKey    DWORD SKEY0    = key->sboxKeys[0]        /* local copy */
/*----------------------------------------------------------------*/
#elif defined(PART_KEY)
#define    MOD_STRING    "(Partial keying)"
#define    Fe32_(x,R)(MDStab[0][_sBox8_(0)[_b(x,R  )]] ^ \
                   MDStab[1][_sBox8_(1)[_b(x,R+1)]] ^ \
                   MDStab[2][_sBox8_(2)[_b(x,R+2)]] ^ \
                   MDStab[3][_sBox8_(3)[_b(x,R+3)]])
#define sbSet(N,i,J,v) { _sBox8_(N)[i+J] = v; }
#define    GetSboxKey
/*----------------------------------------------------------------*/
#else    /* default is FULL_KEY */
#ifndef FULL_KEY
#define    FULL_KEY    1
#endif
#if BIG_TAB
#define    TAB_STR        " (Big table)"
#else
#define    TAB_STR
#endif
#ifdef COMPILE_KEY
#define    MOD_STRING    "(Compiled subkeys)" TAB_STR
#else
#define    MOD_STRING    "(Full keying)" TAB_STR
#endif
/* Fe32_ does a full S-box + MDS lookup.  Need to #define _sBox_ before use.
   Note that we "interleave" 0,1, and 2,3 to avoid cache bank collisions
   in optimized assembly language.
*/
#define    Fe32_(x,R) (_sBox_[0][2*_b(x,R  )] ^ _sBox_[0][2*_b(x,R+1)+1] ^    \
                    _sBox_[2][2*_b(x,R+2)] ^ _sBox_[2][2*_b(x,R+3)+1])
        /* set a single S-box value, given the input byte */
#define sbSet(N,i,J,v) { _sBox_[N&2][2*i+(N&1)+2*J]=MDStab[N][v]; }
#define    GetSboxKey
#endif

CONST        char *moduleDescription    ="Optimized C ";
CONST        char *modeString        =MOD_STRING;


/* macro(s) for debugging help */
#define        CHECK_TABLE        0        /* nonzero --> compare against "slow" table */
#define        VALIDATE_PARMS    0        /* disable for full speed */

#include    "DEBUG.h"                /* debug display macros */

/* end of debug macros */

#ifdef GetCodeSize
extern DWORD Here(DWORD x);            /* return caller's address! */
DWORD TwofishCodeStart(void) { return Here(0); }
#endif

/*
+*****************************************************************************
*
* Function Name:    TableOp
*
* Function:            Handle table use checking
*
* Arguments:        op    =    what to do    (see TAB_* defns in AES.H)
*
* Return:            TRUE --> done (for TAB_QUERY)
*
* Notes: This routine is for use in generating the tables KAT file.
*         For this optimized version, we don't actually track table usage,
*         since it would make the macros incredibly ugly.  Instead we just
*         run for a fixed number of queries and then say we're done.
*
-****************************************************************************/
int TableOp(int op)
    {
    static int queryCnt=0;

    switch (op)
        {
        case TAB_DISABLE:
            break;
        case TAB_ENABLE:
            break;
        case TAB_RESET:
            queryCnt=0;
            break;
        case TAB_QUERY:
            queryCnt++;
            if (queryCnt < TAB_MIN_QUERY)
                return FALSE;
        }
    return TRUE;
    }


/*
+*****************************************************************************
*
* Function Name:    ParseHexDword
*
* Function:            Parse ASCII hex nibbles and fill in key/iv dwords
*
* Arguments:        bit            =    # bits to read
*                    srcTxt        =    ASCII source
*                    d            =    ptr to dwords to fill in
*                    dstTxt        =    where to make a copy of ASCII source
*                                    (NULL ok)
*
* Return:            Zero if no error.  Nonzero --> invalid hex or length
*
* Notes:  Note that the parameter d is a DWORD array, not a byte array.
*    This routine is coded to work both for little-endian and big-endian
*    architectures.  The character stream is interpreted as a LITTLE-ENDIAN
*    byte stream, since that is how the Pentium works, but the conversion
*    happens automatically below.
*
-****************************************************************************/
int ParseHexDword(int bits,CONST char *srcTxt,DWORD *d,char *dstTxt)
    {
    int i;
    char c;
    DWORD b;

    union    /* make sure LittleEndian is defined correctly */
        {
        BYTE  b[4];
        DWORD d[1];
        } v;
    v.d[0]=1;
    if (v.b[0 ^ ADDR_XOR] != 1)
        return BAD_ENDIAN;        /* make sure compile-time switch is set ok */

#if VALIDATE_PARMS
  #if ALIGN32
    if (((int)d) & 3)
        return BAD_ALIGN32;
  #endif
#endif

    for (i=0;i*32<bits;i++)
        d[i]=0;                    /* first, zero the field */

    for (i=0;i*4<bits;i++)        /* parse one nibble at a time */
        {                        /* case out the hexadecimal characters */
        c=srcTxt[i];
        if (dstTxt) dstTxt[i]=c;
        if ((c >= '0') && (c <= '9'))
            b=c-'0';
        else if ((c >= 'a') && (c <= 'f'))
            b=c-'a'+10;
        else if ((c >= 'A') && (c <= 'F'))
            b=c-'A'+10;
        else
            return BAD_KEY_MAT;    /* invalid hex character */
        /* works for big and little endian! */
        d[i/8] |= b << (4*((i^1)&7));
        }

    return 0;                    /* no error */
    }


#if CHECK_TABLE
/*
+*****************************************************************************
*
* Function Name:    f32
*
* Function:            Run four bytes through keyed S-boxes and apply MDS matrix
*
* Arguments:        x            =    input to f function
*                    k32            =    pointer to key dwords
*                    keyLen        =    total key length (k32 --> keyLey/2 bits)
*
* Return:            The output of the keyed permutation applied to x.
*
* Notes:
*    This function is a keyed 32-bit permutation.  It is the major building
*    block for the Twofish round function, including the four keyed 8x8
*    permutations and the 4x4 MDS matrix multiply.  This function is used
*    both for generating round subkeys and within the round function on the
*    block being encrypted.
*
*    This version is fairly slow and pedagogical, although a smartcard would
*    probably perform the operation exactly this way in firmware.   For
*    ultimate performance, the entire operation can be completed with four
*    lookups into four 256x32-bit tables, with three dword xors.
*
*    The MDS matrix is defined in TABLE.H.  To multiply by Mij, just use the
*    macro Mij(x).
*
-****************************************************************************/
DWORD f32(DWORD x,CONST DWORD *k32,int keyLen)
    {
    BYTE  b[4];

    /* Run each byte thru 8x8 S-boxes, xoring with key byte at each stage. */
    /* Note that each byte goes through a different combination of S-boxes.*/

    *((DWORD *)b) = Bswap(x);    /* make b[0] = LSB, b[3] = MSB */
    switch (((keyLen + 63)/64) & 3)
        {
        case 0:        /* 256 bits of key */
            b[0] = p8(04)[b[0]] ^ b0(k32[3]);
            b[1] = p8(14)[b[1]] ^ b1(k32[3]);
            b[2] = p8(24)[b[2]] ^ b2(k32[3]);
            b[3] = p8(34)[b[3]] ^ b3(k32[3]);
            /* fall thru, having pre-processed b[0]..b[3] with k32[3] */
        case 3:        /* 192 bits of key */
            b[0] = p8(03)[b[0]] ^ b0(k32[2]);
            b[1] = p8(13)[b[1]] ^ b1(k32[2]);
            b[2] = p8(23)[b[2]] ^ b2(k32[2]);
            b[3] = p8(33)[b[3]] ^ b3(k32[2]);
            /* fall thru, having pre-processed b[0]..b[3] with k32[2] */
        case 2:        /* 128 bits of key */
            b[0] = p8(00)[p8(01)[p8(02)[b[0]] ^ b0(k32[1])] ^ b0(k32[0])];
            b[1] = p8(10)[p8(11)[p8(12)[b[1]] ^ b1(k32[1])] ^ b1(k32[0])];
            b[2] = p8(20)[p8(21)[p8(22)[b[2]] ^ b2(k32[1])] ^ b2(k32[0])];
            b[3] = p8(30)[p8(31)[p8(32)[b[3]] ^ b3(k32[1])] ^ b3(k32[0])];
        }

    /* Now perform the MDS matrix multiply inline. */
    return    ((M00(b[0]) ^ M01(b[1]) ^ M02(b[2]) ^ M03(b[3]))      ) ^
            ((M10(b[0]) ^ M11(b[1]) ^ M12(b[2]) ^ M13(b[3])) <<  8) ^
            ((M20(b[0]) ^ M21(b[1]) ^ M22(b[2]) ^ M23(b[3])) << 16) ^
            ((M30(b[0]) ^ M31(b[1]) ^ M32(b[2]) ^ M33(b[3])) << 24) ;
    }
#endif    /* CHECK_TABLE */


/*
+*****************************************************************************
*
* Function Name:    RS_MDS_encode
*
* Function:            Use (12,8) Reed-Solomon code over GF(256) to produce
*                    a key S-box dword from two key material dwords.
*
* Arguments:        k0    =    1st dword
*                    k1    =    2nd dword
*
* Return:            Remainder polynomial generated using RS code
*
* Notes:
*    Since this computation is done only once per reKey per 64 bits of key,
*    the performance impact of this routine is imperceptible. The RS code
*    chosen has "simple" coefficients to allow smartcard/hardware implementation
*    without lookup tables.
*
-****************************************************************************/
DWORD RS_MDS_Encode(DWORD k0,DWORD k1)
    {
    int i,j;
    DWORD r;

    for (i=r=0;i<2;i++)
        {
        r ^= (i) ? k0 : k1;            /* merge in 32 more key bits */
        for (j=0;j<4;j++)            /* shift one byte at a time */
            RS_rem(r);
        }
    return r;
    }


/*
+*****************************************************************************
*
* Function Name:    BuildMDS
*
* Function:            Initialize the MDStab array
*
* Arguments:        None.
*
* Return:            None.
*
* Notes:
*    Here we precompute all the fixed MDS table.  This only needs to be done
*    one time at initialization, after which the table is "CONST".
*
-****************************************************************************/
void BuildMDS(void)
    {
    int i;
    DWORD d;
    BYTE m1[2],mX[2],mY[4];

    for (i=0;i<256;i++)
        {
        m1[0]=P8x8[0][i];        /* compute all the matrix elements */
        mX[0]=(BYTE) Mul_X(m1[0]);
        mY[0]=(BYTE) Mul_Y(m1[0]);

        m1[1]=P8x8[1][i];
        mX[1]=(BYTE) Mul_X(m1[1]);
        mY[1]=(BYTE) Mul_Y(m1[1]);

#undef    Mul_1                    /* change what the pre-processor does with Mij */
#undef    Mul_X
#undef    Mul_Y
#define    Mul_1    m1                /* It will now access m01[], m5B[], and mEF[] */
#define    Mul_X    mX
#define    Mul_Y    mY

#define    SetMDS(N)                    \
        b0(d) = M0##N[P_##N##0];    \
        b1(d) = M1##N[P_##N##0];    \
        b2(d) = M2##N[P_##N##0];    \
        b3(d) = M3##N[P_##N##0];    \
        MDStab[N][i] = d;

        SetMDS(0);                /* fill in the matrix with elements computed above */
        SetMDS(1);
        SetMDS(2);
        SetMDS(3);
        }
#undef    Mul_1
#undef    Mul_X
#undef    Mul_Y
#define    Mul_1    Mx_1            /* re-enable true multiply */
#define    Mul_X    Mx_X
#define    Mul_Y    Mx_Y

#if BIG_TAB
    {
    int j,k;
    BYTE *q0,*q1;

    for (i=0;i<4;i++)
        {
        switch (i)
            {
            case 0:    q0=p8(01); q1=p8(02);    break;
            case 1:    q0=p8(11); q1=p8(12);    break;
            case 2:    q0=p8(21); q1=p8(22);    break;
            case 3:    q0=p8(31); q1=p8(32);    break;
            }
        for (j=0;j<256;j++)
            for (k=0;k<256;k++)
                bigTab[i][j][k]=q0[q1[k]^j];
        }
    }
#endif

    needToBuildMDS=0;            /* NEVER modify the table again! */
    }

/*
+*****************************************************************************
*
* Function Name:    ReverseRoundSubkeys
*
* Function:            Reverse order of round subkeys to switch between encrypt/decrypt
*
* Arguments:        key        =    ptr to keyInstance to be reversed
*                    newDir    =    new direction value
*
* Return:            None.
*
* Notes:
*    This optimization allows both blockEncrypt and blockDecrypt to use the same
*    "fallthru" switch statement based on the number of rounds.
*    Note that key->numRounds must be even and >= 2 here.
*
-****************************************************************************/
void ReverseRoundSubkeys(keyInstance *key,BYTE newDir)
    {
    DWORD t0,t1;
    register DWORD *r0=key->subKeys+ROUND_SUBKEYS;
    register DWORD *r1=r0 + 2*key->numRounds - 2;

    for (;r0 < r1;r0+=2,r1-=2)
        {
        t0=r0[0];            /* swap the order */
        t1=r0[1];
        r0[0]=r1[0];        /* but keep relative order within pairs */
        r0[1]=r1[1];
        r1[0]=t0;
        r1[1]=t1;
        }

    key->direction=newDir;
    }

/*
+*****************************************************************************
*
* Function Name:    Xor256
*
* Function:            Copy an 8-bit permutation (256 bytes), xoring with a byte
*
* Arguments:        dst        =    where to put result
*                    src        =    where to get data (can be same asa dst)
*                    b        =    byte to xor
*
* Return:            None
*
* Notes:
*     BorlandC's optimization is terrible!  When we put the code inline,
*    it generates fairly good code in the *following* segment (not in the Xor256
*    code itself).  If the call is made, the code following the call is awful!
*    The penalty is nearly 50%!  So we take the code size hit for inlining for
*    Borland, while Microsoft happily works with a call.
*
-****************************************************************************/
#if defined(__BORLANDC__)    /* do it inline */
#define Xor32(dst,src,i) { ((DWORD *)dst)[i] = ((DWORD *)src)[i] ^ tmpX; }
#define    Xor256(dst,src,b)                \
    {                                    \
    register DWORD tmpX=0x01010101u * b;\
    for (i=0;i<64;i+=4)                    \
        { Xor32(dst,src,i  ); Xor32(dst,src,i+1); Xor32(dst,src,i+2); Xor32(dst,src,i+3); }    \
    }
#else                        /* do it as a function call */
void Xor256(void *dst,void *src,BYTE b)
    {
    register DWORD    x=b*0x01010101u;    /* replicate byte to all four bytes */
    register DWORD *d=(DWORD *)dst;
    register DWORD *s=(DWORD *)src;
#define X_8(N)    { d[N]=s[N] ^ x; d[N+1]=s[N+1] ^ x; }
#define X_32(N)    { X_8(N); X_8(N+2); X_8(N+4); X_8(N+6); }
    X_32(0 ); X_32( 8); X_32(16); X_32(24);    /* all inline */
    d+=32;    /* keep offsets small! */
    s+=32;
    X_32(0 ); X_32( 8); X_32(16); X_32(24);    /* all inline */
    }
#endif

/*
+*****************************************************************************
*
* Function Name:    reKey
*
* Function:            Initialize the Twofish key schedule from key32
*
* Arguments:        key            =    ptr to keyInstance to be initialized
*
* Return:            TRUE on success
*
* Notes:
*    Here we precompute all the round subkeys, although that is not actually
*    required.  For example, on a smartcard, the round subkeys can
*    be generated on-the-fly    using f32()
*
-****************************************************************************/
int reKey(keyInstance *key)
    {
    int        i,j,k64Cnt,keyLen;
    int        subkeyCnt;
    DWORD    A=0,B=0,q;
    DWORD    sKey[MAX_KEY_BITS/64],k32e[MAX_KEY_BITS/64],k32o[MAX_KEY_BITS/64];
    BYTE    L0[256],L1[256];    /* small local 8-bit permutations */

#if VALIDATE_PARMS
  #if ALIGN32
    if (((int)key) & 3)
        return BAD_ALIGN32;
    if ((key->keyLen % 64) || (key->keyLen < MIN_KEY_BITS))
        return BAD_KEY_INSTANCE;
  #endif
#endif

    if (needToBuildMDS)            /* do this one time only */
        BuildMDS();

#define    F32(res,x,k32)    \
    {                                                            \
    DWORD t=x;                                                    \
    switch (k64Cnt & 3)                                            \
        {                                                        \
        case 0:  /* same as 4 */                                \
                    b0(t)   = p8(04)[b0(t)] ^ b0(k32[3]);        \
                    b1(t)   = p8(14)[b1(t)] ^ b1(k32[3]);        \
                    b2(t)   = p8(24)[b2(t)] ^ b2(k32[3]);        \
                    b3(t)   = p8(34)[b3(t)] ^ b3(k32[3]);        \
                 /* fall thru, having pre-processed t */        \
        case 3:        b0(t)   = p8(03)[b0(t)] ^ b0(k32[2]);        \
                    b1(t)   = p8(13)[b1(t)] ^ b1(k32[2]);        \
                    b2(t)   = p8(23)[b2(t)] ^ b2(k32[2]);        \
                    b3(t)   = p8(33)[b3(t)] ^ b3(k32[2]);        \
                 /* fall thru, having pre-processed t */        \
        case 2:     /* 128-bit keys (optimize for this case) */    \
            res=    MDStab[0][p8(01)[p8(02)[b0(t)] ^ b0(k32[1])] ^ b0(k32[0])] ^    \
                    MDStab[1][p8(11)[p8(12)[b1(t)] ^ b1(k32[1])] ^ b1(k32[0])] ^    \
                    MDStab[2][p8(21)[p8(22)[b2(t)] ^ b2(k32[1])] ^ b2(k32[0])] ^    \
                    MDStab[3][p8(31)[p8(32)[b3(t)] ^ b3(k32[1])] ^ b3(k32[0])] ;    \
        }                                                        \
    }


#if !CHECK_TABLE
#if defined(USE_ASM)                /* only do this if not using assember */
if (!(useAsm & 4))
#endif
#endif
    {
    subkeyCnt = ROUND_SUBKEYS + 2*key->numRounds;
    keyLen=key->keyLen;
    k64Cnt=(keyLen+63)/64;            /* number of 64-bit key words */
    for (i=0,j=k64Cnt-1;i<k64Cnt;i++,j--)
        {                            /* split into even/odd key dwords */
        k32e[i]=key->key32[2*i  ];
        k32o[i]=key->key32[2*i+1];
        /* compute S-box keys using (12,8) Reed-Solomon code over GF(256) */
        sKey[j]=key->sboxKeys[j]=RS_MDS_Encode(k32e[i],k32o[i]);    /* reverse order */
        }
    }

#ifdef USE_ASM
if (useAsm & 4)
    {
    #if defined(COMPILE_KEY) && defined(USE_ASM)
        key->keySig        = VALID_SIG;            /* show that we are initialized */
        key->codeSize    = sizeof(key->compiledCode);    /* set size */
    #endif
    reKey_86(key);
    }
else
#endif
    {
    for (i=q=0;i<subkeyCnt/2;i++,q+=SK_STEP)
        {                            /* compute round subkeys for PHT */
        F32(A,q        ,k32e);        /* A uses even key dwords */
        F32(B,q+SK_BUMP,k32o);        /* B uses odd  key dwords */
        B = ROL(B,8);
        key->subKeys[2*i  ] = A+B;    /* combine with a PHT */
        B = A + 2*B;
        key->subKeys[2*i+1] = ROL(B,SK_ROTL);
        }
#if !defined(ZERO_KEY)
    switch (keyLen)    /* case out key length for speed in generating S-boxes */
        {
        case 128:
        #if defined(FULL_KEY) || defined(PART_KEY)
#if BIG_TAB
            #define    one128(N,J)    sbSet(N,i,J,L0[i+J])
            #define    sb128(N) {                        \
                BYTE *qq=bigTab[N][b##N(sKey[1])];    \
                Xor256(L0,qq,b##N(sKey[0]));        \
                for (i=0;i<256;i+=2) { one128(N,0); one128(N,1); } }
#else
            #define    one128(N,J)    sbSet(N,i,J,p8(N##1)[L0[i+J]]^k0)
            #define    sb128(N) {                    \
                Xor256(L0,p8(N##2),b##N(sKey[1]));    \
                { register DWORD k0=b##N(sKey[0]);    \
                for (i=0;i<256;i+=2) { one128(N,0); one128(N,1); } } }
#endif
        #elif defined(MIN_KEY)
            #define    sb128(N) Xor256(_sBox8_(N),p8(N##2),b##N(sKey[1]))
        #endif
            sb128(0); sb128(1); sb128(2); sb128(3);
            break;
        case 192:
        #if defined(FULL_KEY) || defined(PART_KEY)
            #define one192(N,J) sbSet(N,i,J,p8(N##1)[p8(N##2)[L0[i+J]]^k1]^k0)
            #define    sb192(N) {                        \
                Xor256(L0,p8(N##3),b##N(sKey[2]));    \
                { register DWORD k0=b##N(sKey[0]);    \
                  register DWORD k1=b##N(sKey[1]);    \
                  for (i=0;i<256;i+=2) { one192(N,0); one192(N,1); } } }
        #elif defined(MIN_KEY)
            #define one192(N,J) sbSet(N,i,J,p8(N##2)[L0[i+J]]^k1)
            #define    sb192(N) {                        \
                Xor256(L0,p8(N##3),b##N(sKey[2]));    \
                { register DWORD k1=b##N(sKey[1]);    \
                  for (i=0;i<256;i+=2) { one192(N,0); one192(N,1); } } }
        #endif
            sb192(0); sb192(1); sb192(2); sb192(3);
            break;
        case 256:
        #if defined(FULL_KEY) || defined(PART_KEY)
            #define one256(N,J) sbSet(N,i,J,p8(N##1)[p8(N##2)[L0[i+J]]^k1]^k0)
            #define    sb256(N) {                                        \
                Xor256(L1,p8(N##4),b##N(sKey[3]));                    \
                for (i=0;i<256;i+=2) {L0[i  ]=p8(N##3)[L1[i]];        \
                                      L0[i+1]=p8(N##3)[L1[i+1]]; }    \
                Xor256(L0,L0,b##N(sKey[2]));                        \
                { register DWORD k0=b##N(sKey[0]);                    \
                  register DWORD k1=b##N(sKey[1]);                    \
                  for (i=0;i<256;i+=2) { one256(N,0); one256(N,1); } } }
        #elif defined(MIN_KEY)
            #define one256(N,J) sbSet(N,i,J,p8(N##2)[L0[i+J]]^k1)
            #define    sb256(N) {                                        \
                Xor256(L1,p8(N##4),b##N(sKey[3]));                    \
                for (i=0;i<256;i+=2) {L0[i  ]=p8(N##3)[L1[i]];        \
                                      L0[i+1]=p8(N##3)[L1[i+1]]; }    \
                Xor256(L0,L0,b##N(sKey[2]));                        \
                { register DWORD k1=b##N(sKey[1]);                    \
                  for (i=0;i<256;i+=2) { one256(N,0); one256(N,1); } } }
        #endif
            sb256(0); sb256(1);    sb256(2); sb256(3);
            break;
        }
#endif
    }

#if CHECK_TABLE                        /* sanity check  vs. pedagogical code*/
    {
    GetSboxKey;
    for (i=0;i<subkeyCnt/2;i++)
        {
        A = f32(i*SK_STEP        ,k32e,keyLen);    /* A uses even key dwords */
        B = f32(i*SK_STEP+SK_BUMP,k32o,keyLen);    /* B uses odd  key dwords */
        B = ROL(B,8);
        assert(key->subKeys[2*i  ] == A+  B);
        assert(key->subKeys[2*i+1] == ROL(A+2*B,SK_ROTL));
        }
  #if !defined(ZERO_KEY)            /* any S-boxes to check? */
    for (i=q=0;i<256;i++,q+=0x01010101)
        assert(f32(q,key->sboxKeys,keyLen) == Fe32_(q,0));
  #endif
    }
#endif /* CHECK_TABLE */

    DebugDumpKey(key);

    if (key->direction == DIR_ENCRYPT)
        ReverseRoundSubkeys(key,DIR_ENCRYPT);    /* reverse the round subkey order */

    return TRUE;
    }
/*
+*****************************************************************************
*
* Function Name:    makeKey
*
* Function:            Initialize the Twofish key schedule
*
* Arguments:        key            =    ptr to keyInstance to be initialized
*                    direction    =    DIR_ENCRYPT or DIR_DECRYPT
*                    keyLen        =    # bits of key text at *keyMaterial
*                    keyMaterial    =    ptr to hex ASCII chars representing key bits
*
* Return:            TRUE on success
*                    else error code (e.g., BAD_KEY_DIR)
*
* Notes:    This parses the key bits from keyMaterial.  Zeroes out unused key bits
*
-****************************************************************************/
int makeKey(keyInstance *key, BYTE direction, int keyLen,CONST char *keyMaterial)
    {
#if VALIDATE_PARMS                /* first, sanity check on parameters */
    if (key == NULL)
        return BAD_KEY_INSTANCE;/* must have a keyInstance to initialize */
    if ((direction != DIR_ENCRYPT) && (direction != DIR_DECRYPT))
        return BAD_KEY_DIR;        /* must have valid direction */
    if ((keyLen > MAX_KEY_BITS) || (keyLen < 8) || (keyLen & 0x3F))
        return BAD_KEY_MAT;        /* length must be valid */
    key->keySig = VALID_SIG;    /* show that we are initialized */
  #if ALIGN32
    if ((((int)key) & 3) || (((int)key->key32) & 3))
        return BAD_ALIGN32;
  #endif
#endif

    key->direction    = direction;/* set our cipher direction */
    key->keyLen        = (keyLen+63) & ~63;        /* round up to multiple of 64 */
    key->numRounds    = numRounds[(keyLen-1)/64];
    memset(key->key32,0,sizeof(key->key32));    /* zero unused bits */
    key->keyMaterial[MAX_KEY_SIZE]=0;    /* terminate ASCII string */

    if ((keyMaterial == NULL) || (keyMaterial[0]==0))
        return TRUE;            /* allow a "dummy" call */

    if (ParseHexDword(keyLen,keyMaterial,key->key32,key->keyMaterial))
        return BAD_KEY_MAT;

    return reKey(key);            /* generate round subkeys */
    }


/*
+*****************************************************************************
*
* Function Name:    cipherInit
*
* Function:            Initialize the Twofish cipher in a given mode
*
* Arguments:        cipher        =    ptr to cipherInstance to be initialized
*                    mode        =    MODE_ECB, MODE_CBC, or MODE_CFB1
*                    IV            =    ptr to hex ASCII test representing IV bytes
*
* Return:            TRUE on success
*                    else error code (e.g., BAD_CIPHER_MODE)
*
-****************************************************************************/
int cipherInit(cipherInstance *cipher, BYTE mode,CONST char *IV)
    {
    int i;
#if VALIDATE_PARMS                /* first, sanity check on parameters */
    if (cipher == NULL)
        return BAD_PARAMS;        /* must have a cipherInstance to initialize */
    if ((mode != MODE_ECB) && (mode != MODE_CBC) && (mode != MODE_CFB1))
        return BAD_CIPHER_MODE;    /* must have valid cipher mode */
    cipher->cipherSig    =    VALID_SIG;
  #if ALIGN32
    if ((((int)cipher) & 3) || (((int)cipher->IV) & 3) || (((int)cipher->iv32) & 3))
        return BAD_ALIGN32;
  #endif
#endif

    if ((mode != MODE_ECB) && (IV))    /* parse the IV */
        {
        if (ParseHexDword(BLOCK_SIZE,IV,cipher->iv32,NULL))
            return BAD_IV_MAT;
        for (i=0;i<BLOCK_SIZE/32;i++)    /* make byte-oriented copy for CFB1 */
            ((DWORD *)cipher->IV)[i] = Bswap(cipher->iv32[i]);
        }

    cipher->mode        =    mode;

    return TRUE;
    }

/*
+*****************************************************************************
*
* Function Name:    blockEncrypt
*
* Function:            Encrypt block(s) of data using Twofish
*
* Arguments:        cipher        =    ptr to already initialized cipherInstance
*                    key            =    ptr to already initialized keyInstance
*                    input        =    ptr to data blocks to be encrypted
*                    inputLen    =    # bits to encrypt (multiple of blockSize)
*                    outBuffer    =    ptr to where to put encrypted blocks
*
* Return:            # bits ciphered (>= 0)
*                    else error code (e.g., BAD_CIPHER_STATE, BAD_KEY_MATERIAL)
*
* Notes: The only supported block size for ECB/CBC modes is BLOCK_SIZE bits.
*         If inputLen is not a multiple of BLOCK_SIZE bits in those modes,
*         an error BAD_INPUT_LEN is returned.  In CFB1 mode, all block
*         sizes can be supported.
*
-****************************************************************************/
[COLOR=royalblue][B]__device__[/B] int blockEncrypt(cipherInstance *cipher, keyInstance *key,CONST BYTE *input,
                int inputLen, BYTE *outBuffer)
    {
    int   i,n;                        /* loop counters */
    DWORD x[BLOCK_SIZE/32];            /* block being encrypted */
    DWORD t0,t1;                    /* temp variables */
    int      rounds=key->numRounds;    /* number of rounds */
    BYTE  bit,bit0,ctBit,carry;        /* temps for CFB */

    /* make local copies of things for faster access */
    int      mode = cipher->mode;
    DWORD sk[TOTAL_SUBKEYS];
    DWORD IV[BLOCK_SIZE/32];

    GetSboxKey;

#if VALIDATE_PARMS
    if ((cipher == NULL) || (cipher->cipherSig != VALID_SIG))
        return BAD_CIPHER_STATE;
    if ((key == NULL) || (key->keySig != VALID_SIG))
        return BAD_KEY_INSTANCE;
    if ((rounds < 2) || (rounds > MAX_ROUNDS) || (rounds&1))
        return BAD_KEY_INSTANCE;
    if ((mode != MODE_CFB1) && (inputLen % BLOCK_SIZE))
        return BAD_INPUT_LEN;
  #if ALIGN32
    if ( (((int)cipher) & 3) || (((int)key      ) & 3) ||
         (((int)input ) & 3) || (((int)outBuffer) & 3))
        return BAD_ALIGN32;
  #endif
#endif

    if (mode == MODE_CFB1)
        {    /* use recursion here to handle CFB, one block at a time */
        cipher->mode = MODE_ECB;    /* do encryption in ECB */
        for (n=0;n<inputLen;n++)
            {
            blockEncrypt(cipher,key,cipher->IV,BLOCK_SIZE,(BYTE *)x);
            bit0  = 0x80 >> (n & 7);/* which bit position in byte */
            ctBit = (input[n/8] & bit0) ^ ((((BYTE *) x)[0] & 0x80) >> (n&7));
            outBuffer[n/8] = (outBuffer[n/8] & ~ bit0) | ctBit;
            carry = ctBit >> (7 - (n&7));
            for (i=BLOCK_SIZE/8-1;i>=0;i--)
                {
                bit = cipher->IV[i] >> 7;    /* save next "carry" from shift */
                cipher->IV[i] = (cipher->IV[i] << 1) ^ carry;
                carry = bit;
                }
            }
        cipher->mode = MODE_CFB1;    /* restore mode for next time */
        return inputLen;
        }

    /* here for ECB, CBC modes */
    if (key->direction != DIR_ENCRYPT)
        ReverseRoundSubkeys(key,DIR_ENCRYPT);    /* reverse the round subkey order */

#ifdef USE_ASM
    if ((useAsm & 1) && (inputLen))
  #ifdef COMPILE_KEY
        if (key->keySig == VALID_SIG)
            return ((CipherProc *)(key->encryptFuncPtr))(cipher,key,input,inputLen,outBuffer);
  #else
        return (*blockEncrypt_86)(cipher,key,input,inputLen,outBuffer);
  #endif
#endif
    /* make local copy of subkeys for speed */
    memcpy(sk,key->subKeys,sizeof(DWORD)*(ROUND_SUBKEYS+2*rounds));
    if (mode == MODE_CBC) {
        BlockCopy(IV,cipher->iv32);
    } else {
        IV[0]=IV[1]=IV[2]=IV[3]=0;
    }

    for (n=0;n<inputLen;n+=BLOCK_SIZE,input+=BLOCK_SIZE/8,outBuffer+=BLOCK_SIZE/8)
        {
#ifdef DEBUG
        DebugDump(input,"\n",-1,0,0,0,1);
        if (cipher->mode == MODE_CBC)
            DebugDump(cipher->iv32,"",IV_ROUND,0,0,0,0);
#endif
#define    LoadBlockE(N)  x[N]=Bswap(((DWORD *)input)[N]) ^ sk[INPUT_WHITEN+N] ^ IV[N]
        LoadBlockE(0);    LoadBlockE(1);    LoadBlockE(2);    LoadBlockE(3);
        DebugDump(x,"",0,0,0,0,0);
#define    EncryptRound(K,R,id)    \
            t0       = Fe32##id(x[K  ],0);                    \
            t1       = Fe32##id(x[K^1],3);                    \
            x[K^3] = ROL(x[K^3],1);                            \
            x[K^2]^= t0 +   t1 + sk[ROUND_SUBKEYS+2*(R)  ];    \
            x[K^3]^= t0 + 2*t1 + sk[ROUND_SUBKEYS+2*(R)+1];    \
            x[K^2] = ROR(x[K^2],1);                            \
            DebugDump(x,"",rounds-(R),0,0,1,0);
#define        Encrypt2(R,id)    { EncryptRound(0,R+1,id); EncryptRound(2,R,id); }

#if defined(ZERO_KEY)
        switch (key->keyLen)
            {
            case 128:
                for (i=rounds-2;i>=0;i-=2)
                    Encrypt2(i,_128);
                break;
            case 192:
                for (i=rounds-2;i>=0;i-=2)
                    Encrypt2(i,_192);
                break;
            case 256:
                for (i=rounds-2;i>=0;i-=2)
                    Encrypt2(i,_256);
                break;
            }
#else
        Encrypt2(14,_);
        Encrypt2(12,_);
        Encrypt2(10,_);
        Encrypt2( 8,_);
        Encrypt2( 6,_);
        Encrypt2( 4,_);
        Encrypt2( 2,_);
        Encrypt2( 0,_);
#endif

        /* need to do (or undo, depending on your point of view) final swap */
#if LittleEndian
#define    StoreBlockE(N)    ((DWORD *)outBuffer)[N]=x[N^2] ^ sk[OUTPUT_WHITEN+N]
#else
#define    StoreBlockE(N)    { t0=x[N^2] ^ sk[OUTPUT_WHITEN+N]; ((DWORD *)outBuffer)[N]=Bswap(t0); }
#endif
        StoreBlockE(0);    StoreBlockE(1);    StoreBlockE(2);    StoreBlockE(3);
        if (mode == MODE_CBC)
            {
            IV[0]=Bswap(((DWORD *)outBuffer)[0]);
            IV[1]=Bswap(((DWORD *)outBuffer)[1]);
            IV[2]=Bswap(((DWORD *)outBuffer)[2]);
            IV[3]=Bswap(((DWORD *)outBuffer)[3]);
            }
#ifdef DEBUG
        DebugDump(outBuffer,"",rounds+1,0,0,0,1);
        if (cipher->mode == MODE_CBC)
            DebugDump(cipher->iv32,"",IV_ROUND,0,0,0,0);
#endif
        }

    if (mode == MODE_CBC)
        BlockCopy(cipher->iv32,IV);

    return inputLen;
    }

/*
+*****************************************************************************
*
* Function Name:    blockDecrypt
*
* Function:            Decrypt block(s) of data using Twofish
*
* Arguments:        cipher        =    ptr to already initialized cipherInstance
*                    key            =    ptr to already initialized keyInstance
*                    input        =    ptr to data blocks to be decrypted
*                    inputLen    =    # bits to encrypt (multiple of blockSize)
*                    outBuffer    =    ptr to where to put decrypted blocks
*
* Return:            # bits ciphered (>= 0)
*                    else error code (e.g., BAD_CIPHER_STATE, BAD_KEY_MATERIAL)
*
* Notes: The only supported block size for ECB/CBC modes is BLOCK_SIZE bits.
*         If inputLen is not a multiple of BLOCK_SIZE bits in those modes,
*         an error BAD_INPUT_LEN is returned.  In CFB1 mode, all block
*         sizes can be supported.
*
-****************************************************************************/
[COLOR=royalblue][B]__device__[/B] int blockDecrypt(cipherInstance *cipher, keyInstance *key,CONST BYTE *input,
                int inputLen, BYTE *outBuffer)
    {
    int   i,n;                        /* loop counters */
    DWORD x[BLOCK_SIZE/32];            /* block being encrypted */
    DWORD t0,t1;                    /* temp variables */
    int      rounds=key->numRounds;    /* number of rounds */
    BYTE  bit,bit0,ctBit,carry;        /* temps for CFB */

    /* make local copies of things for faster access */
    int      mode = cipher->mode;
    DWORD sk[TOTAL_SUBKEYS];
    DWORD IV[BLOCK_SIZE/32];

    GetSboxKey;

#if VALIDATE_PARMS
    if ((cipher == NULL) || (cipher->cipherSig != VALID_SIG))
        return BAD_CIPHER_STATE;
    if ((key == NULL) || (key->keySig != VALID_SIG))
        return BAD_KEY_INSTANCE;
    if ((rounds < 2) || (rounds > MAX_ROUNDS) || (rounds&1))
        return BAD_KEY_INSTANCE;
    if ((cipher->mode != MODE_CFB1) && (inputLen % BLOCK_SIZE))
        return BAD_INPUT_LEN;
  #if ALIGN32
    if ( (((int)cipher) & 3) || (((int)key      ) & 3) ||
         (((int)input)  & 3) || (((int)outBuffer) & 3))
        return BAD_ALIGN32;
  #endif
#endif

    if (cipher->mode == MODE_CFB1)
        {    /* use blockEncrypt here to handle CFB, one block at a time */
        cipher->mode = MODE_ECB;    /* do encryption in ECB */
        for (n=0;n<inputLen;n++)
            {
            blockEncrypt(cipher,key,cipher->IV,BLOCK_SIZE,(BYTE *)x);
            bit0  = 0x80 >> (n & 7);
            ctBit = input[n/8] & bit0;
            outBuffer[n/8] = (outBuffer[n/8] & ~ bit0) |
                             (ctBit ^ ((((BYTE *) x)[0] & 0x80) >> (n&7)));
            carry = ctBit >> (7 - (n&7));
            for (i=BLOCK_SIZE/8-1;i>=0;i--)
                {
                bit = cipher->IV[i] >> 7;    /* save next "carry" from shift */
                cipher->IV[i] = (cipher->IV[i] << 1) ^ carry;
                carry = bit;
                }
            }
        cipher->mode = MODE_CFB1;    /* restore mode for next time */
        return inputLen;
        }

    /* here for ECB, CBC modes */
    if (key->direction != DIR_DECRYPT)
        ReverseRoundSubkeys(key,DIR_DECRYPT);    /* reverse the round subkey order */
#ifdef USE_ASM
    if ((useAsm & 2) && (inputLen))
  #ifdef COMPILE_KEY
        if (key->keySig == VALID_SIG)
            return ((CipherProc *)(key->decryptFuncPtr))(cipher,key,input,inputLen,outBuffer);
  #else
        return (*blockDecrypt_86)(cipher,key,input,inputLen,outBuffer);
  #endif
#endif
    /* make local copy of subkeys for speed */
    memcpy(sk,key->subKeys,sizeof(DWORD)*(ROUND_SUBKEYS+2*rounds));
    if (mode == MODE_CBC) {
        BlockCopy(IV,cipher->iv32);
    } else {
        IV[0]=IV[1]=IV[2]=IV[3]=0;
    }

    for (n=0;n<inputLen;n+=BLOCK_SIZE,input+=BLOCK_SIZE/8,outBuffer+=BLOCK_SIZE/8)
        {
        DebugDump(input,"\n",rounds+1,0,0,0,1);
#define LoadBlockD(N) x[N^2]=Bswap(((DWORD *)input)[N]) ^ sk[OUTPUT_WHITEN+N]
        LoadBlockD(0);    LoadBlockD(1);    LoadBlockD(2);    LoadBlockD(3);

#define    DecryptRound(K,R,id)                                \
            t0       = Fe32##id(x[K  ],0);                    \
            t1       = Fe32##id(x[K^1],3);                    \
            DebugDump(x,"",(R)+1,0,0,1,0);                    \
            x[K^2] = ROL (x[K^2],1);                        \
            x[K^2]^= t0 +   t1 + sk[ROUND_SUBKEYS+2*(R)  ];    \
            x[K^3]^= t0 + 2*t1 + sk[ROUND_SUBKEYS+2*(R)+1];    \
            x[K^3] = ROR (x[K^3],1);                        \

#define        Decrypt2(R,id)    { DecryptRound(2,R+1,id); DecryptRound(0,R,id); }

#if defined(ZERO_KEY)
        switch (key->keyLen)
            {
            case 128:
                for (i=rounds-2;i>=0;i-=2)
                    Decrypt2(i,_128);
                break;
            case 192:
                for (i=rounds-2;i>=0;i-=2)
                    Decrypt2(i,_192);
                break;
            case 256:
                for (i=rounds-2;i>=0;i-=2)
                    Decrypt2(i,_256);
                break;
            }
#else
        {
        Decrypt2(14,_);
        Decrypt2(12,_);
        Decrypt2(10,_);
        Decrypt2( 8,_);
        Decrypt2( 6,_);
        Decrypt2( 4,_);
        Decrypt2( 2,_);
        Decrypt2( 0,_);
        }
#endif
        DebugDump(x,"",0,0,0,0,0);
        if (cipher->mode == MODE_ECB)
            {
#if LittleEndian
#define    StoreBlockD(N)    ((DWORD *)outBuffer)[N] = x[N] ^ sk[INPUT_WHITEN+N]
#else
#define    StoreBlockD(N)    { t0=x[N]^sk[INPUT_WHITEN+N]; ((DWORD *)outBuffer)[N] = Bswap(t0); }
#endif
            StoreBlockD(0);    StoreBlockD(1);    StoreBlockD(2);    StoreBlockD(3);
#undef  StoreBlockD
            DebugDump(outBuffer,"",-1,0,0,0,1);
            continue;
            }
        else
            {
#define    StoreBlockD(N)    x[N]   ^= sk[INPUT_WHITEN+N] ^ IV[N];    \
                        IV[N]   = Bswap(((DWORD *)input)[N]);    \
                        ((DWORD *)outBuffer)[N] = Bswap(x[N]);
            StoreBlockD(0);    StoreBlockD(1);    StoreBlockD(2);    StoreBlockD(3);
#undef  StoreBlockD
            DebugDump(outBuffer,"",-1,0,0,0,1);
            }
        }
    if (mode == MODE_CBC)    /* restore iv32 to cipher */
        BlockCopy(cipher->iv32,IV);

    return inputLen;
    }

#ifdef GetCodeSize
DWORD TwofishCodeSize(void)
    {
    DWORD x= Here(0);
#ifdef USE_ASM
    if (useAsm & 3)
        return TwofishAsmCodeSize();
#endif
    return x - TwofishCodeStart();
    };
#endif
[COLOR=royalblue][B]
__global__ void encrypt(cipherInstance *cipher, keyInstance *key, CONST BYTE *input, int inputLen, BYTE *outBuffer) {
    int b = blockIdx.x*blockDim.x + threadIdx.x;
    blockEncrypt(cipher, key, input + 16*b, inputLen, outBuffer);
}

__global__ void decrypt(cipherInstance *cipher, keyInstance *key, CONST BYTE *input, int inputLen, BYTE *outBuffer) {
    int b = blockIdx.x*blockDim.x + threadIdx.x;
    blockDecrypt(cipher, key, input + 16*b, inputLen, outBuffer);
}

inline void HandleErrorImpl(cudaError error, const char* file, int line) {
    if (error != cudaSuccess) {
        printf("%s: %d %s", file, line, cudaGetErrorString(error));
        exit(1);
    }
}

int main(void) {
    char* initKey = "meingeheimespasswort";
    int threads = 512;
    int blocks = 12800;
    cipherInstance *cipher;
    keyInstance *key;
    int inputLen = 16*threads*blocks;

    CONST BYTE *input;
    HandleError(cudaMalloc(&input, 16*threads*blocks));
    HandleError(cudaMemset(input, 0, 16*threads*blocks));

    BYTE *outBuffer;
    HandleError(cudaMalloc(&outBuffer, 16*threads*blocks));

    char *test1 = (char*)malloc(16*threads*blocks);
    HandleError(cudaMemcpy(test1, input, 16*threads*blocks, cudaMemcpyDeviceToHost));

    printf("First 16 bytes of input data: ");
    for (int i=0; i<16; i++) {
        printf("%d, ", test1[i]);
    }

    cudaEvent_t start, stop;
    HandleError(cudaEventCreate(&start));
    HandleError(cudaEventCreate(&stop));
    HandleError(cudaEventRecord( start, 0 ));

    encrypt<<<blocks,threads>>>(cipher, key, input, inputLen, outBuffer);
    decrypt<<<blocks,threads>>>(cipher, key, outBuffer, inputLen, input);

    HandleError(cudaGetLastError());

    HandleError(cudaEventRecord( stop, 0 ));
    HandleError(cudaEventSynchronize( stop ));
    float elapsedTime;
    HandleError(cudaEventElapsedTime( &elapsedTime, start, stop ));

    printf("\nPerformance: %.6f MB/sec\n", 2*((16*threads*blocks)/1048576)/(elapsedTime/1000));

    HandleError(cudaEventDestroy( start ));
    HandleError(cudaEventDestroy( stop ));

    char *test2 = (char*)malloc(16*threads*blocks);
    HandleError(cudaMemcpy(test2, outBuffer, 16*threads*blocks, cudaMemcpyDeviceToHost));

    printf("After decryption: ");
    for (int i=0; i<16; i++) {
        printf("%d, ", test2[i]);
    }

    free(test1);
    free(test2);

    HandleError(cudaFree(input));
    HandleError(cudaFree(outBuffer));

    return 0;
}[/B]
Anhang anzeigen Twofish.zip

[Edit]
Wenn ich ReverseRoundSubkeys() zu einer __host__ __device__ Funktion mache, dann kriege ich folgenden Error:

snapshot38.png

Kann ich's trotzdem so lassen? Hier meint einer, dass Rekursion auf Hardware ab Compute capability 2.0 unterstützt wird. Kannst du das bitte mal mit deiner GTX 560 Ti testen? Ist halt blöd, dass ich das jetzt wieder nicht testen kann und mit dem Error weiterprogrammieren muss. *seufz*
 
Zuletzt bearbeitet:
AW: [CUDA] RC6 / Serpent /Twofish

Der rekursive Teil scheint nur für den CFB Modus benutzt zu werden. Da du den aber eh nicht nutzen willst kannst du den ganzen if-Block mit "if (cipher->mode == MODE_CFB1)" rausnehmen. Dann hast du keine Rekusion mehr aber der Code sieht nicht sonderlich gut aus um nach CUDA portiert zu werden. Die blockDecrypt sind wohl eher dazu gedacht gleich mehr Daten zu ver- und entschlüsseln und du willst eigentlich nur einen Block verarbeiten. Außerdem sehe ich grade noch irgendwelche Debug Outputs und sonstwas das man wohl alles rauswerfen müsste.

Außerdem verstehe ich deine Aussage zu fremden CUDA-Code nicht. Der Serpent Code war definitiv nicht für CUDA sonst hättest du nicht überall __device__ und sonstwas ergänzen müssen. Wenn du fertigen CUDA-Code findest ist schon alles fertig und du musst nur noch deine Kernel schreiben und dich nicht um komische Fehler kümmern weil die Funktionen komische Sachen machen die so auf GPU nicht funktionieren.
 
AW: [CUDA] RC6 / Serpent /Twofish

Zu Serpent: Was ich gefunden hatte, war ein .cu File, aber ohne irgendwas darin, das auf CUDA hingedeute hätte. Keine __device__, keine __global__, nichts. Ich war mir dann auch nicht sicher, was das war. C Code in einem .cu File?! Egal ...

Zu Twofish: Tja, ich hätte heute Früh einen anderen Code gehabt, der sich aber aus über 20 einzelnen Files zusammengesetzt hatte. Mit sowas wollte ich echt nicht arbeiten. Ich hatte da gar keinen Überblick. Dann habe ich beim Weitersuchen den hier mit 5 Files gefunden. Meinst du, ich solle hier weiterbasteln, oder lieber noch einen anderen Twofish suchen?
 
AW: [CUDA] RC6 / Serpent /Twofish

^^ Hm, ich habe mir den Code jetzt mal angeschaut und auch die ganzen unnötigen Leerzeilen gelöscht, allerdings findet Nsight Eclipse 83 Error und im Code selber werden zumindest 2 Dinge rot unterwellt:
Code:
/* This is an independent implementation of the encryption algorithm:   */
/*                                                                      */
/*         Twofish by Bruce Schneier and colleagues                     */
/*                                                                      */
/* which is a candidate algorithm in the Advanced Encryption Standard   */
/* programme of the US National Institute of Standards and Technology.  */
/*                                                                      */
/* Copyright in this implementation is held by Dr B R Gladman but I     */
/* hereby give permission for its free direct or derivative use subject */
/* to acknowledgment of its origin and compliance with any conditions   */
/* that the originators of t he algorithm place on its exploitation.     */
/*                                                                      */
/* My thanks to Doug Whiting and Niels Ferguson for comments that led   */
/* to improvements in this implementation.                              */
/*                                                                      */
/* Dr Brian Gladman (gladman@seven77.demon.co.uk) 14th January 1999     */

/* Timing data for Twofish (twofish.c)

128 bit key:
Key Setup:    8414 cycles
Encrypt:       376 cycles =    68.1 mbits/sec
Decrypt:       374 cycles =    68.4 mbits/sec
Mean:          375 cycles =    68.3 mbits/sec

192 bit key:
Key Setup:   11628 cycles
Encrypt:       376 cycles =    68.1 mbits/sec
Decrypt:       374 cycles =    68.4 mbits/sec
Mean:          375 cycles =    68.3 mbits/sec

256 bit key:
Key Setup:   15457 cycles
Encrypt:       381 cycles =    67.2 mbits/sec
Decrypt:       374 cycles =    68.4 mbits/sec
Mean:          378 cycles =    67.8 mbits/sec

*/

#include "std_defs.h"
#include <stdio.h>

#define Q_TABLES
#define M_TABLE
#define MK_TABLE
#define ONE_STEP

static char *alg_name[] = { "twofish", "twofish.c", "twofish" };

char **cipher_name()
{
    [COLOR=red][B]return alg_name;[/B]}

u4byte  k_len;
u4byte  l_key[40];
u4byte  s_key[4];

/* finite field arithmetic for GF(2**8) with the modular    */
/* polynomial x^8 + x^6 + x^5 + x^3 + 1 (0x169)             */

#define G_M 0x0169

u1byte  tab_5b[4] = { 0, G_M >> 2, G_M >> 1, (G_M >> 1) ^ (G_M >> 2) };
u1byte  tab_ef[4] = { 0, (G_M >> 1) ^ (G_M >> 2), G_M >> 1, G_M >> 2 };

#define ffm_01(x)    (x)
#define ffm_5b(x)   ((x) ^ ((x) >> 2) ^ tab_5b[(x) & 3])
#define ffm_ef(x)   ((x) ^ ((x) >> 1) ^ ((x) >> 2) ^ tab_ef[(x) & 3])

u1byte ror4[16] = { 0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15 };
u1byte ashx[16] = { 0, 9, 2, 11, 4, 13, 6, 15, 8, 1, 10, 3, 12, 5, 14, 7 };

u1byte qt0[2][16] =
{   { 8, 1, 7, 13, 6, 15, 3, 2, 0, 11, 5, 9, 14, 12, 10, 4 },
    { 2, 8, 11, 13, 15, 7, 6, 14, 3, 1, 9, 4, 0, 10, 12, 5 }
};

u1byte qt1[2][16] =
{   { 14, 12, 11, 8, 1, 2, 3, 5, 15, 4, 10, 6, 7, 0, 9, 13 },
    { 1, 14, 2, 11, 4, 12, 3, 7, 6, 13, 10, 5, 15, 9, 0, 8 }
};

u1byte qt2[2][16] =
{   { 11, 10, 5, 14, 6, 13, 9, 0, 12, 8, 15, 3, 2, 4, 7, 1 },
    { 4, 12, 7, 5, 1, 6, 9, 10, 0, 14, 13, 8, 2, 11, 3, 15 }
};

u1byte qt3[2][16] =
{   { 13, 7, 15, 4, 1, 2, 6, 14, 9, 11, 3, 0, 8, 5, 12, 10 },
    { 11, 9, 5, 1, 12, 3, 13, 14, 6, 4, 7, 15, 2, 0, 8, 10 }
};

u1byte qp(const u4byte n, const u1byte x)
{   u1byte  a0, a1, a2, a3, a4, b0, b1, b2, b3, b4;

    a0 = x >> 4; b0 = x & 15;
    a1 = a0 ^ b0; b1 = ror4[b0] ^ ashx[a0];
    a2 = qt0[n][a1]; b2 = qt1[n][b1];
    a3 = a2 ^ b2; b3 = ror4[b2] ^ ashx[a2];
    a4 = qt2[n][a3]; b4 = qt3[n][b3];

    return (b4 << 4) | a4;
};

#ifdef  Q_TABLES

u4byte  qt_gen = 0;
u1byte  q_tab[2][256];

#define q(n,x)  q_tab[n][x]

void gen_qtab(void)
{   u4byte  i;

    for(i = 0; i < 256; ++i)
    {
        q(0,i) = qp(0, (u1byte)i);
        q(1,i) = qp(1, (u1byte)i);
    }
};

#else
#define q(n,x)  qp(n, x)
#endif

#ifdef  M_TABLE

u4byte  mt_gen = 0;
u4byte  m_tab[4][256];

void gen_mtab(void)
{   u4byte  i, f01, f5b, fef;

    for(i = 0; i < 256; ++i)
    {
        f01 = q(1,i); f5b = ffm_5b(f01); fef = ffm_ef(f01);
        m_tab[0][i] = f01 + (f5b << 8) + (fef << 16) + (fef << 24);
        m_tab[2][i] = f5b + (fef << 8) + (f01 << 16) + (fef << 24);

        f01 = q(0,i); f5b = ffm_5b(f01); fef = ffm_ef(f01);
        m_tab[1][i] = fef + (fef << 8) + (f5b << 16) + (f01 << 24);
        m_tab[3][i] = f5b + (f01 << 8) + (fef << 16) + (f5b << 24);
    }
};

#define mds(n,x)    m_tab[n][x]

#else

#define fm_00   ffm_01
#define fm_10   ffm_5b
#define fm_20   ffm_ef
#define fm_30   ffm_ef

#define q_0(x)  q(1,x)

#define fm_01   ffm_ef
#define fm_11   ffm_ef
#define fm_21   ffm_5b
#define fm_31   ffm_01

#define q_1(x)  q(0,x)

#define fm_02   ffm_5b
#define fm_12   ffm_ef
#define fm_22   ffm_01
#define fm_32   ffm_ef

#define q_2(x)  q(1,x)

#define fm_03   ffm_5b
#define fm_13   ffm_01
#define fm_23   ffm_e
#define fm_33   ffm_5b

#define q_3(x)  q(0,x)

#define f_0(n,x)    ((u4byte)fm_0##n(x))
#define f_1(n,x)    ((u4byte)fm_1##n(x) << 8)
#define f_2(n,x)    ((u4byte)fm_2##n(x) << 16)
#define f_3(n,x)    ((u4byte)fm_3##n(x) << 24)

#define mds(n,x)    f_0(n,q_##n(x)) ^ f_1(n,q_##n(x)) ^ f_2(n,q_##n(x)) ^ f_3(n,q_##n(x))

#endif

u4byte h_fun(const u4byte x, const u4byte key[])
{   u4byte  b0, b1, b2, b3;

#ifndef M_TABLE
    u4byte  m5b_b0, m5b_b1, m5b_b2, m5b_b3;
    u4byte  mef_b0, mef_b1, mef_b2, mef_b3;
#endif

    b0 = byte(x, 0); b1 = byte(x, 1); b2 = byte(x, 2); b3 = byte(x, 3);

    switch(k_len)
    {
    case 4: b0 = q(1, b0) ^ byte(key[3],0);
            b1 = q(0, b1) ^ byte(key[3],1);
            b2 = q(0, b2) ^ byte(key[3],2);
            b3 = q(1, b3) ^ byte(key[3],3);

    case 3: b0 = q(1, b0) ^ byte(key[2],0);
            b1 = q(1, b1) ^ byte(key[2],1);
            b2 = q(0, b2) ^ byte(key[2],2);
            b3 = q(0, b3) ^ byte(key[2],3);

    case 2: b0 = q(0,q(0,b0) ^ byte(key[1],0)) ^ byte(key[0],0);
            b1 = q(0,q(1,b1) ^ byte(key[1],1)) ^ byte(key[0],1);
            b2 = q(1,q(0,b2) ^ byte(key[1],2)) ^ byte(key[0],2);
            b3 = q(1,q(1,b3) ^ byte(key[1],3)) ^ byte(key[0],3);
    }

#ifdef  M_TABLE

    return  mds(0, b0) ^ mds(1, b1) ^ mds(2, b2) ^ mds(3, b3);

#else

    b0 = q(1, b0); b1 = q(0, b1); b2 = q(1, b2); b3 = q(0, b3);
    m5b_b0 = ffm_5b(b0); m5b_b1 = ffm_5b(b1); m5b_b2 = ffm_5b(b2); m5b_b3 = ffm_5b(b3);
    mef_b0 = ffm_ef(b0); mef_b1 = ffm_ef(b1); mef_b2 = ffm_ef(b2); mef_b3 = ffm_ef(b3);
    b0 ^= mef_b1 ^ m5b_b2 ^ m5b_b3; b3 ^= m5b_b0 ^ mef_b1 ^ mef_b2;
    b2 ^= mef_b0 ^ m5b_b1 ^ mef_b3; b1 ^= mef_b0 ^ mef_b2 ^ m5b_b3;

    return b0 | (b3 << 8) | (b2 << 16) | (b1 << 24);

#endif
};

#ifdef  MK_TABLE

#ifdef  ONE_STEP

u4byte  mk_tab[4][256];

#else

u1byte  sb[4][256];

#endif

#define q20(x)  q(0,q(0,x) ^ byte(key[1],0)) ^ byte(key[0],0)
#define q21(x)  q(0,q(1,x) ^ byte(key[1],1)) ^ byte(key[0],1)
#define q22(x)  q(1,q(0,x) ^ byte(key[1],2)) ^ byte(key[0],2)
#define q23(x)  q(1,q(1,x) ^ byte(key[1],3)) ^ byte(key[0],3)

#define q30(x)  q(0,q(0,q(1, x) ^ byte(key[2],0)) ^ byte(key[1],0)) ^ byte(key[0],0)
#define q31(x)  q(0,q(1,q(1, x) ^ byte(key[2],1)) ^ byte(key[1],1)) ^ byte(key[0],1)
#define q32(x)  q(1,q(0,q(0, x) ^ byte(key[2],2)) ^ byte(key[1],2)) ^ byte(key[0],2)
#define q33(x)  q(1,q(1,q(0, x) ^ byte(key[2],3)) ^ byte(key[1],3)) ^ byte(key[0],3)

#define q40(x)  q(0,q(0,q(1, q(1, x) ^ byte(key[3],0)) ^ byte(key[2],0)) ^ byte(key[1],0)) ^ byte(key[0],0)
#define q41(x)  q(0,q(1,q(1, q(0, x) ^ byte(key[3],1)) ^ byte(key[2],1)) ^ byte(key[1],1)) ^ byte(key[0],1)
#define q42(x)  q(1,q(0,q(0, q(0, x) ^ byte(key[3],2)) ^ byte(key[2],2)) ^ byte(key[1],2)) ^ byte(key[0],2)
#define q43(x)  q(1,q(1,q(0, q(1, x) ^ byte(key[3],3)) ^ byte(key[2],3)) ^ byte(key[1],3)) ^ byte(key[0],3)

[COLOR=red][B]gen_mk_tab(u4byte key[])[/B]{   u4byte  i;
    u1byte  by;

    switch(k_len)
    {
    case 2: for(i = 0; i < 256; ++i)
            {
                by = (u1byte)i;

#ifdef ONE_STEP

                mk_tab[0][i] = mds(0, q20(by)); mk_tab[1][i] = mds(1, q21(by));
                mk_tab[2][i] = mds(2, q22(by)); mk_tab[3][i] = mds(3, q23(by));

#else

                sb[0][i] = q20(by); sb[1][i] = q21(by);
                sb[2][i] = q22(by); sb[3][i] = q23(by);

#endif
            }
            break;

    case 3: for(i = 0; i < 256; ++i)
            {
                by = (u1byte)i;

#ifdef ONE_STEP

                mk_tab[0][i] = mds(0, q30(by)); mk_tab[1][i] = mds(1, q31(by));
                mk_tab[2][i] = mds(2, q32(by)); mk_tab[3][i] = mds(3, q33(by));

#else

                sb[0][i] = q30(by); sb[1][i] = q31(by);
                sb[2][i] = q32(by); sb[3][i] = q33(by);

#endif
            }
            break;

    case 4: for(i = 0; i < 256; ++i)
            {
                by = (u1byte)i;

#ifdef ONE_STEP

                mk_tab[0][i] = mds(0, q40(by)); mk_tab[1][i] = mds(1, q41(by));
                mk_tab[2][i] = mds(2, q42(by)); mk_tab[3][i] = mds(3, q43(by));

#else

                sb[0][i] = q40(by); sb[1][i] = q41(by);
                sb[2][i] = q42(by); sb[3][i] = q43(by);

#endif
            }
    }
};

#  ifdef ONE_STEP
#    define g0_fun(x) ( mk_tab[0][byte(x,0)] ^ mk_tab[1][byte(x,1)] \
                      ^ mk_tab[2][byte(x,2)] ^ mk_tab[3][byte(x,3)] )
#    define g1_fun(x) ( mk_tab[0][byte(x,3)] ^ mk_tab[1][byte(x,0)] \
                      ^ mk_tab[2][byte(x,1)] ^ mk_tab[3][byte(x,2)] )
#  else
#    define g0_fun(x) ( mds(0, sb[0][byte(x,0)]) ^ mds(1, sb[1][byte(x,1)]) \
                      ^ mds(2, sb[2][byte(x,2)]) ^ mds(3, sb[3][byte(x,3)]) )
#    define g1_fun(x) ( mds(0, sb[0][byte(x,3)]) ^ mds(1, sb[1][byte(x,0)]) \
                      ^ mds(2, sb[2][byte(x,1)]) ^ mds(3, sb[3][byte(x,2)]) )
#  endif

#else

#define g0_fun(x)   h_fun(x,s_key)
#define g1_fun(x)   h_fun(rotl(x,8),s_key)

#endif

/* The (12,8) Reed Soloman code has the generator polynomial
  g(x) = x^4 + (a + 1/a) * x^3 + a * x^2 + (a + 1/a) * x + 1

where the coefficients are in the finite field GF(2^8) with a
modular polynomial a^8 + a^6 + a^3 + a^2 + 1. To generate the
remainder we have to start with a 12th order polynomial with our
eight input bytes as the coefficients of the 4th to 11th terms.
That is:

  m[7] * x^11 + m[6] * x^10 ... + m[0] * x^4 + 0 * x^3 +... + 0

We then multiply the generator polynomial by m[7] * x^7 and subtract
it - xor in GF(2^8) - from the above to eliminate the x^7 term (the
artihmetic on the coefficients is done in GF(2^8). We then multiply
the generator polynomial by x^6 * coeff(x^10) and use this to remove
the x^10 term. We carry on in this way until the x^4 term is removed
so that we are left with:

  r[3] * x^3 + r[2] * x^2 + r[1] 8 x^1 + r[0]

which give the resulting 4 bytes of the remainder. This is equivalent
to the matrix multiplication in the Twofish description but much faster
to implement.

*/

#define G_MOD   0x0000014d

u4byte mds_rem(u4byte p0, u4byte p1)
{   u4byte  i, t, u;

    for(i = 0; i < 8; ++i)
    {
        t = p1 >> 24;   // get most significant coefficient
        p1 = (p1 << 8) | (p0 >> 24); p0 <<= 8;  // shift others up

        // multiply t by a (the primitive element - i.e. left shift)

        u = (t << 1);

        if(t & 0x80)            // subtract modular polynomial on overflow
            u ^= G_MOD;

        p1 ^= t ^ (u << 16);    // remove t * (a * x^2 + 1)
        u ^= (t >> 1);          // form u = a * t + t / a = t * (a + 1 / a);

        if(t & 0x01)            // add the modular polynomial on underflow
            u ^= G_MOD >> 1;

        p1 ^= (u << 24) | (u << 8); // remove t * (a + 1/a) * (x^3 + x)
    }

    return p1;

};

/* initialise the key schedule from the user supplied key   */

u4byte *set_key(const u4byte in_key[], const u4byte key_len)
{   u4byte  i, a, b, me_key[4], mo_key[4];

#ifdef Q_TABLES

    if(!qt_gen)
    {
        gen_qtab(); qt_gen = 1;
    }

#endif

#ifdef M_TABLE

    if(!mt_gen)
    {
        gen_mtab(); mt_gen = 1;
    }

#endif

    k_len = key_len / 64;   /* 2, 3 or 4 */

    for(i = 0; i < k_len; ++i)
    {
        a = in_key[i + i];     me_key[i] = a;
        b = in_key[i + i + 1]; mo_key[i] = b;
        s_key[k_len - i - 1] = mds_rem(a, b);
    }

    for(i = 0; i < 40; i += 2)
    {
        a = 0x01010101 * i; b = a + 0x01010101;
        a = h_fun(a, me_key);
        b = rotl(h_fun(b, mo_key), 8);
        l_key[i] = a + b;
        l_key[i + 1] = rotl(a + 2 * b, 9);
    }

#ifdef MK_TABLE

    gen_mk_tab(s_key);

#endif

    return l_key;

};

/* encrypt a block of text  */

#define f_rnd(i)                                                    \
    t1 = g1_fun(blk[1]); t0 = g0_fun(blk[0]);                       \
    blk[2] = rotr(blk[2] ^ (t0 + t1 + l_key[4 * (i) + 8]), 1);      \
    blk[3] = rotl(blk[3], 1) ^ (t0 + 2 * t1 + l_key[4 * (i) + 9]);  \
    t1 = g1_fun(blk[3]); t0 = g0_fun(blk[2]);                       \
    blk[0] = rotr(blk[0] ^ (t0 + t1 + l_key[4 * (i) + 10]), 1);     \
    blk[1] = rotl(blk[1], 1) ^ (t0 + 2 * t1 + l_key[4 * (i) + 11])

void encrypt(const u4byte in_blk[4], u4byte out_blk[])
{   u4byte  t0, t1, blk[4];

    blk[0] = in_blk[0] ^ l_key[0];
    blk[1] = in_blk[1] ^ l_key[1];
    blk[2] = in_blk[2] ^ l_key[2];
    blk[3] = in_blk[3] ^ l_key[3];

    f_rnd(0); f_rnd(1); f_rnd(2); f_rnd(3);
    f_rnd(4); f_rnd(5); f_rnd(6); f_rnd(7);

    out_blk[0] = blk[2] ^ l_key[4];
    out_blk[1] = blk[3] ^ l_key[5];
    out_blk[2] = blk[0] ^ l_key[6];
    out_blk[3] = blk[1] ^ l_key[7];
};

/* decrypt a block of text  */

#define i_rnd(i)                                                        \
        t1 = g1_fun(blk[1]); t0 = g0_fun(blk[0]);                       \
        blk[2] = rotl(blk[2], 1) ^ (t0 + t1 + l_key[4 * (i) + 10]);     \
        blk[3] = rotr(blk[3] ^ (t0 + 2 * t1 + l_key[4 * (i) + 11]), 1); \
        t1 = g1_fun(blk[3]); t0 = g0_fun(blk[2]);                       \
        blk[0] = rotl(blk[0], 1) ^ (t0 + t1 + l_key[4 * (i) +  8]);     \
        blk[1] = rotr(blk[1] ^ (t0 + 2 * t1 + l_key[4 * (i) +  9]), 1)

void decrypt(const u4byte in_blk[4], u4byte out_blk[4])
{   u4byte  t0, t1, blk[4];

    blk[0] = in_blk[0] ^ l_key[4];
    blk[1] = in_blk[1] ^ l_key[5];
    blk[2] = in_blk[2] ^ l_key[6];
    blk[3] = in_blk[3] ^ l_key[7];

    i_rnd(7); i_rnd(6); i_rnd(5); i_rnd(4);
    i_rnd(3); i_rnd(2); i_rnd(1); i_rnd(0);

    out_blk[0] = blk[2] ^ l_key[0];
    out_blk[1] = blk[3] ^ l_key[1];
    out_blk[2] = blk[0] ^ l_key[2];
    out_blk[3] = blk[1] ^ l_key[3];
};

int main(void) {

    printf("Hallo!\n");

    return 0;
}
Code:
/* 1. Standard types for AES cryptography source code               */

typedef unsigned char   u1byte; /* an 8 bit unsigned character type */
typedef unsigned short  u2byte; /* a 16 bit unsigned integer type   */
typedef unsigned long   u4byte; /* a 32 bit unsigned integer type   */
typedef signed char     s1byte; /* an 8 bit signed character type   */
typedef signed short    s2byte; /* a 16 bit signed integer type     */
typedef signed long     s4byte; /* a 32 bit signed integer type     */

/* 2. Standard interface for AES cryptographic routines             */
/* These are all based on 32 bit unsigned values and will therefore */
/* require endian conversions for big-endian architectures          */

#ifdef  __cplusplus
    extern "C"
    {
#endif

    char **cipher_name(void);
    u4byte *set_key(const u4byte in_key[], const u4byte key_len);
    void encrypt(const u4byte in_blk[4], u4byte out_blk[4]);
    void decrypt(const u4byte in_blk[4], u4byte out_blk[4]);

#ifdef  __cplusplus
    };
#endif

/* 3. Basic macros for speeding up generic operations               */
/* Circular rotate of 32 bit values                                 */

#ifdef _MSC_VER

#  include <stdlib.h>

#  pragma intrinsic(_lrotr,_lrotl)
#  define rotr(x,n) _lrotr(x,n)
#  define rotl(x,n) _lrotl(x,n)

#else

#define rotr(x,n)   (((x) >> ((int)(n))) | ((x) << (32 - (int)(n))))
#define rotl(x,n)   (((x) << ((int)(n))) | ((x) >> (32 - (int)(n))))

#endif

/* Invert byte order in a 32 bit variable                           */
#define bswap(x)    (rotl(x, 8) & 0x00ff00ff | rotr(x, 8) & 0xff00ff00)

/* Extract byte from a 32 bit quantity (little endian notation)     */
#define byte(x,n)   ((u1byte)((x) >> (8 * n)))

/* For inverting byte order in input/output 32 bit words if needed  */
#ifdef  BLOCK_SWAP
#define BYTE_SWAP
#define WORD_SWAP
#endif

#ifdef  BYTE_SWAP
#define io_swap(x)  bswap(x)
#else
#define io_swap(x)  (x)
#endif

/* For inverting the byte order of input/output blocks if needed    */
#ifdef  WORD_SWAP
#define get_block(x)                            \
    ((u4byte*)(x))[0] = io_swap(in_blk[3]);     \
    ((u4byte*)(x))[1] = io_swap(in_blk[2]);     \
    ((u4byte*)(x))[2] = io_swap(in_blk[1]);     \
    ((u4byte*)(x))[3] = io_swap(in_blk[0])

#define put_block(x)                            \
    out_blk[3] = io_swap(((u4byte*)(x))[0]);    \
    out_blk[2] = io_swap(((u4byte*)(x))[1]);    \
    out_blk[1] = io_swap(((u4byte*)(x))[2]);    \
    out_blk[0] = io_swap(((u4byte*)(x))[3])

#define get_key(x,len)                          \
    ((u4byte*)(x))[4] = ((u4byte*)(x))[5] =     \
    ((u4byte*)(x))[6] = ((u4byte*)(x))[7] = 0;  \
    switch((((len) + 63) / 64)) {                  \

    case 2:                                     \
    ((u4byte*)(x))[0] = io_swap(in_key[3]);     \
    ((u4byte*)(x))[1] = io_swap(in_key[2]);     \
    ((u4byte*)(x))[2] = io_swap(in_key[1]);     \
    ((u4byte*)(x))[3] = io_swap(in_key[0]);     \
    break;                                      \

    case 3:                                     \
    ((u4byte*)(x))[0] = io_swap(in_key[5]);     \
    ((u4byte*)(x))[1] = io_swap(in_key[4]);     \
    ((u4byte*)(x))[2] = io_swap(in_key[3]);     \
    ((u4byte*)(x))[3] = io_swap(in_key[2]);     \
    ((u4byte*)(x))[4] = io_swap(in_key[1]);     \
    ((u4byte*)(x))[5] = io_swap(in_key[0]);     \
    break;                                      \

    case 4:                                     \
    ((u4byte*)(x))[0] = io_swap(in_key[7]);     \
    ((u4byte*)(x))[1] = io_swap(in_key[6]);     \
    ((u4byte*)(x))[2] = io_swap(in_key[5]);     \
    ((u4byte*)(x))[3] = io_swap(in_key[4]);     \
    ((u4byte*)(x))[4] = io_swap(in_key[3]);     \
    ((u4byte*)(x))[5] = io_swap(in_key[2]);     \
    ((u4byte*)(x))[6] = io_swap(in_key[1]);     \
    ((u4byte*)(x))[7] = io_swap(in_key[0]);     \
    }

#else

#define get_block(x)                            \
    ((u4byte*)(x))[0] = io_swap(in_blk[0]);     \
    ((u4byte*)(x))[1] = io_swap(in_blk[1]);     \
    ((u4byte*)(x))[2] = io_swap(in_blk[2]);     \
    ((u4byte*)(x))[3] = io_swap(in_blk[3])

#define put_block(x)                            \
    out_blk[0] = io_swap(((u4byte*)(x))[0]);    \
    out_blk[1] = io_swap(((u4byte*)(x))[1]);    \
    out_blk[2] = io_swap(((u4byte*)(x))[2]);    \
    out_blk[3] = io_swap(((u4byte*)(x))[3])

#define get_key(x,len)                          \
    ((u4byte*)(x))[4] = ((u4byte*)(x))[5] =     \
    ((u4byte*)(x))[6] = ((u4byte*)(x))[7] = 0;  \

   [COLOR=red][B] switch((((len) + 63) / 64)) {   [/B]            \
    case 4:                                     \
    ((u4byte*)(x))[6] = io_swap(in_key[6]);     \
    ((u4byte*)(x))[7] = io_swap(in_key[7]);     \

    case 3:                                     \
    ((u4byte*)(x))[4] = io_swap(in_key[4]);     \
    ((u4byte*)(x))[5] = io_swap(in_key[5]);     \

    case 2:                                     \
    ((u4byte*)(x))[0] = io_swap(in_key[0]);     \
    ((u4byte*)(x))[1] = io_swap(in_key[1]);     \
    ((u4byte*)(x))[2] = io_swap(in_key[2]);     \
    ((u4byte*)(x))[3] = io_swap(in_key[3]);     \
    }

#endif
Im .cu File heißt's beim ersten roten: identifier alg_name is undefined
Beim zweiten roten: explicit type is missing ("int" assumed) --> könnte ich mit int oder void lösen. Was nehmen?

Im .h File heißt's beim switch: expected a declaration.
Was soll ich denn einfügen? Ist eh alles da?! Habe auch schon die Klammern abgezählt; müsste stimmen?!

Und jedes Mal, wenn ich den Code ausführen will, sehe ich das hier:

snapshot39.png

[Edit]
Im .cu File scheint der Compiler nicht über Zeile 48 hinauszukommen. Da kommt ne Meldung von wegen "previous syntax error", aber vor Zeile 48 gibt's doch noch fast nichts, wo was falsch sein könnte. Ich sehe davor keinen Fehler.

[Edit2]
Zeile 40 scheint das Problem zu sein. Wenn ich statt #include "std_defs.h" sage: #include "../std_defs.h", dann verschwinden in beiden Files die Errors, aber das .h File wird vom .cu File verständlicherweise nicht mehr gefunden. :(
 
Zuletzt bearbeitet:
AW: [CUDA] RC6 / Serpent /Twofish

Bitte sag mir, dass du nicht die ganzen doppelten Zeilenumbrüche von Hand weggemacht hast? In der std_defs.h, bei dem switch ist nämlich eine falsche Leerzeile, die nicht mit \ endet und damit das Makro beendet. Hier Anhang anzeigen twofish.zip mal die von mir bereinigte Version. Der einzige Fehler neben den doppelten Zeilenumbrüchen war ein fehlendes "void" bei der Funktion "gen_mk_tab(u4byte key[])". So compiliert aber wenigstens alles ohne Fehler oder Warnungen.
 
AW: [CUDA] RC6 / Serpent /Twofish

Okay, danke. Abgesehen von zig warnings habe ich's bis auf 2 Errors hingekriegt:
Code:
/* This is an independent implementation of the encryption algorithm:   */
/*                                                                      */
/*         Twofish by Bruce Schneier and colleagues                     */
/*                                                                      */
/* which is a candidate algorithm in the Advanced Encryption Standard   */
/* programme of the US National Institute of Standards and Technology.  */
/*                                                                      */
/* Copyright in this implementation is held by Dr B R Gladman but I     */
/* hereby give permission for its free direct or derivative use subject */
/* to acknowledgment of its origin and compliance with any conditions   */
/* that the originators of t he algorithm place on its exploitation.    */
/*                                                                      */
/* My thanks to Doug Whiting and Niels Ferguson for comments that led   */
/* to improvements in this implementation.                              */
/*                                                                      */
/* Dr Brian Gladman (gladman@seven77.demon.co.uk) 14th January 1999     */

/* Timing data for Twofish (twofish.c)

128 bit key:
Key Setup:    8414 cycles
Encrypt:       376 cycles =    68.1 mbits/sec
Decrypt:       374 cycles =    68.4 mbits/sec
Mean:          375 cycles =    68.3 mbits/sec

192 bit key:
Key Setup:   11628 cycles
Encrypt:       376 cycles =    68.1 mbits/sec
Decrypt:       374 cycles =    68.4 mbits/sec
Mean:          375 cycles =    68.3 mbits/sec

256 bit key:
Key Setup:   15457 cycles
Encrypt:       381 cycles =    67.2 mbits/sec
Decrypt:       374 cycles =    68.4 mbits/sec
Mean:          378 cycles =    67.8 mbits/sec

*/

#include "std_defs.h"
[COLOR=royalblue][B]#include <stdio.h>[/B]
#define Q_TABLES
#define M_TABLE
#define MK_TABLE
#define ONE_STEP

[COLOR=royalblue][B]#define HandleError(x) HandleErrorImpl(x, __FILE__, __LINE__)[/B]
static char *alg_name[] = { "twofish", "twofish.c", "twofish" };

char **cipher_name()
{
    return alg_name;
}

u4byte  k_len;
u4byte  l_key[40];
u4byte  s_key[4];

/* finite field arithmetic for GF(2**8) with the modular    */
/* polynomial x^8 + x^6 + x^5 + x^3 + 1 (0x169)             */

#define G_M 0x0169

u1byte  tab_5b[4] = { 0, G_M >> 2, G_M >> 1, (G_M >> 1) ^ (G_M >> 2) };
u1byte  tab_ef[4] = { 0, (G_M >> 1) ^ (G_M >> 2), G_M >> 1, G_M >> 2 };

#define ffm_01(x)    (x)
#define ffm_5b(x)   ((x) ^ ((x) >> 2) ^ tab_5b[(x) & 3])
#define ffm_ef(x)   ((x) ^ ((x) >> 1) ^ ((x) >> 2) ^ tab_ef[(x) & 3])

u1byte ror4[16] = { 0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15 };
u1byte ashx[16] = { 0, 9, 2, 11, 4, 13, 6, 15, 8, 1, 10, 3, 12, 5, 14, 7 };

u1byte qt0[2][16] =
{   { 8, 1, 7, 13, 6, 15, 3, 2, 0, 11, 5, 9, 14, 12, 10, 4 },
    { 2, 8, 11, 13, 15, 7, 6, 14, 3, 1, 9, 4, 0, 10, 12, 5 }
};

u1byte qt1[2][16] =
{   { 14, 12, 11, 8, 1, 2, 3, 5, 15, 4, 10, 6, 7, 0, 9, 13 },
    { 1, 14, 2, 11, 4, 12, 3, 7, 6, 13, 10, 5, 15, 9, 0, 8 }
};

u1byte qt2[2][16] =
{   { 11, 10, 5, 14, 6, 13, 9, 0, 12, 8, 15, 3, 2, 4, 7, 1 },
    { 4, 12, 7, 5, 1, 6, 9, 10, 0, 14, 13, 8, 2, 11, 3, 15 }
};

u1byte qt3[2][16] =
{   { 13, 7, 15, 4, 1, 2, 6, 14, 9, 11, 3, 0, 8, 5, 12, 10 },
    { 11, 9, 5, 1, 12, 3, 13, 14, 6, 4, 7, 15, 2, 0, 8, 10 }
};

u1byte qp(const u4byte n, const u1byte x)
{   u1byte  a0, a1, a2, a3, a4, b0, b1, b2, b3, b4;

    a0 = x >> 4; b0 = x & 15;
    a1 = a0 ^ b0; b1 = ror4[b0] ^ ashx[a0];
    a2 = qt0[n][a1]; b2 = qt1[n][b1];
    a3 = a2 ^ b2; b3 = ror4[b2] ^ ashx[a2];
    a4 = qt2[n][a3]; b4 = qt3[n][b3];
    return (b4 << 4) | a4;
};

#ifdef  Q_TABLES

u4byte  qt_gen = 0;
u1byte  q_tab[2][256];

#define q(n,x)  q_tab[n][x]

void gen_qtab(void)
{   u4byte  i;

    for(i = 0; i < 256; ++i)
    {
        q(0,i) = qp(0, (u1byte)i);
        q(1,i) = qp(1, (u1byte)i);
    }
};

#else

#define q(n,x)  qp(n, x)

#endif

#ifdef  M_TABLE

u4byte  mt_gen = 0;
u4byte  m_tab[4][256];

void gen_mtab(void)
{   u4byte  i, f01, f5b, fef;

    for(i = 0; i < 256; ++i)
    {
        f01 = q(1,i); f5b = ffm_5b(f01); fef = ffm_ef(f01);
        m_tab[0][i] = f01 + (f5b << 8) + (fef << 16) + (fef << 24);
        m_tab[2][i] = f5b + (fef << 8) + (f01 << 16) + (fef << 24);

        f01 = q(0,i); f5b = ffm_5b(f01); fef = ffm_ef(f01);
        m_tab[1][i] = fef + (fef << 8) + (f5b << 16) + (f01 << 24);
        m_tab[3][i] = f5b + (f01 << 8) + (fef << 16) + (f5b << 24);
    }
};

#define mds(n,x)    m_tab[n][x]

#else

#define fm_00   ffm_01
#define fm_10   ffm_5b
#define fm_20   ffm_ef
#define fm_30   ffm_ef
#define q_0(x)  q(1,x)

#define fm_01   ffm_ef
#define fm_11   ffm_ef
#define fm_21   ffm_5b
#define fm_31   ffm_01
#define q_1(x)  q(0,x)

#define fm_02   ffm_5b
#define fm_12   ffm_ef
#define fm_22   ffm_01
#define fm_32   ffm_ef
#define q_2(x)  q(1,x)

#define fm_03   ffm_5b
#define fm_13   ffm_01
#define fm_23   ffm_ef
#define fm_33   ffm_5b
#define q_3(x)  q(0,x)

#define f_0(n,x)    ((u4byte)fm_0##n(x))
#define f_1(n,x)    ((u4byte)fm_1##n(x) << 8)
#define f_2(n,x)    ((u4byte)fm_2##n(x) << 16)
#define f_3(n,x)    ((u4byte)fm_3##n(x) << 24)

#define mds(n,x)    f_0(n,q_##n(x)) ^ f_1(n,q_##n(x)) ^ f_2(n,q_##n(x)) ^ f_3(n,q_##n(x))

#endif

u4byte h_fun(const u4byte x, const u4byte key[])
{   u4byte  b0, b1, b2, b3;

#ifndef M_TABLE
    u4byte  m5b_b0, m5b_b1, m5b_b2, m5b_b3;
    u4byte  mef_b0, mef_b1, mef_b2, mef_b3;
#endif

    b0 = byte(x, 0); b1 = byte(x, 1); b2 = byte(x, 2); b3 = byte(x, 3);

    switch(k_len)
    {
    case 4: b0 = q(1, b0) ^ byte(key[3],0);
            b1 = q(0, b1) ^ byte(key[3],1);
            b2 = q(0, b2) ^ byte(key[3],2);
            b3 = q(1, b3) ^ byte(key[3],3);
    case 3: b0 = q(1, b0) ^ byte(key[2],0);
            b1 = q(1, b1) ^ byte(key[2],1);
            b2 = q(0, b2) ^ byte(key[2],2);
            b3 = q(0, b3) ^ byte(key[2],3);
    case 2: b0 = q(0,q(0,b0) ^ byte(key[1],0)) ^ byte(key[0],0);
            b1 = q(0,q(1,b1) ^ byte(key[1],1)) ^ byte(key[0],1);
            b2 = q(1,q(0,b2) ^ byte(key[1],2)) ^ byte(key[0],2);
            b3 = q(1,q(1,b3) ^ byte(key[1],3)) ^ byte(key[0],3);
    }
#ifdef  M_TABLE

    return  mds(0, b0) ^ mds(1, b1) ^ mds(2, b2) ^ mds(3, b3);

#else

    b0 = q(1, b0); b1 = q(0, b1); b2 = q(1, b2); b3 = q(0, b3);
    m5b_b0 = ffm_5b(b0); m5b_b1 = ffm_5b(b1); m5b_b2 = ffm_5b(b2); m5b_b3 = ffm_5b(b3);
    mef_b0 = ffm_ef(b0); mef_b1 = ffm_ef(b1); mef_b2 = ffm_ef(b2); mef_b3 = ffm_ef(b3);
    b0 ^= mef_b1 ^ m5b_b2 ^ m5b_b3; b3 ^= m5b_b0 ^ mef_b1 ^ mef_b2;
    b2 ^= mef_b0 ^ m5b_b1 ^ mef_b3; b1 ^= mef_b0 ^ mef_b2 ^ m5b_b3;

    return b0 | (b3 << 8) | (b2 << 16) | (b1 << 24);

#endif
};

#ifdef  MK_TABLE

#ifdef  ONE_STEP
u4byte  mk_tab[4][256];
#else
u1byte  sb[4][256];
#endif

#define q20(x)  q(0,q(0,x) ^ byte(key[1],0)) ^ byte(key[0],0)
#define q21(x)  q(0,q(1,x) ^ byte(key[1],1)) ^ byte(key[0],1)
#define q22(x)  q(1,q(0,x) ^ byte(key[1],2)) ^ byte(key[0],2)
#define q23(x)  q(1,q(1,x) ^ byte(key[1],3)) ^ byte(key[0],3)

#define q30(x)  q(0,q(0,q(1, x) ^ byte(key[2],0)) ^ byte(key[1],0)) ^ byte(key[0],0)
#define q31(x)  q(0,q(1,q(1, x) ^ byte(key[2],1)) ^ byte(key[1],1)) ^ byte(key[0],1)
#define q32(x)  q(1,q(0,q(0, x) ^ byte(key[2],2)) ^ byte(key[1],2)) ^ byte(key[0],2)
#define q33(x)  q(1,q(1,q(0, x) ^ byte(key[2],3)) ^ byte(key[1],3)) ^ byte(key[0],3)

#define q40(x)  q(0,q(0,q(1, q(1, x) ^ byte(key[3],0)) ^ byte(key[2],0)) ^ byte(key[1],0)) ^ byte(key[0],0)
#define q41(x)  q(0,q(1,q(1, q(0, x) ^ byte(key[3],1)) ^ byte(key[2],1)) ^ byte(key[1],1)) ^ byte(key[0],1)
#define q42(x)  q(1,q(0,q(0, q(0, x) ^ byte(key[3],2)) ^ byte(key[2],2)) ^ byte(key[1],2)) ^ byte(key[0],2)
#define q43(x)  q(1,q(1,q(0, q(1, x) ^ byte(key[3],3)) ^ byte(key[2],3)) ^ byte(key[1],3)) ^ byte(key[0],3)

void gen_mk_tab(u4byte key[])
{   u4byte  i;
    u1byte  by;

    switch(k_len)
    {
    case 2: for(i = 0; i < 256; ++i)
            {
                by = (u1byte)i;
#ifdef ONE_STEP
                mk_tab[0][i] = mds(0, q20(by)); mk_tab[1][i] = mds(1, q21(by));
                mk_tab[2][i] = mds(2, q22(by)); mk_tab[3][i] = mds(3, q23(by));
#else
                sb[0][i] = q20(by); sb[1][i] = q21(by);
                sb[2][i] = q22(by); sb[3][i] = q23(by);
#endif
            }
            break;

    case 3: for(i = 0; i < 256; ++i)
            {
                by = (u1byte)i;
#ifdef ONE_STEP
                mk_tab[0][i] = mds(0, q30(by)); mk_tab[1][i] = mds(1, q31(by));
                mk_tab[2][i] = mds(2, q32(by)); mk_tab[3][i] = mds(3, q33(by));
#else
                sb[0][i] = q30(by); sb[1][i] = q31(by);
                sb[2][i] = q32(by); sb[3][i] = q33(by);
#endif
            }
            break;

    case 4: for(i = 0; i < 256; ++i)
            {
                by = (u1byte)i;
#ifdef ONE_STEP
                mk_tab[0][i] = mds(0, q40(by)); mk_tab[1][i] = mds(1, q41(by));
                mk_tab[2][i] = mds(2, q42(by)); mk_tab[3][i] = mds(3, q43(by));
#else
                sb[0][i] = q40(by); sb[1][i] = q41(by);
                sb[2][i] = q42(by); sb[3][i] = q43(by);
#endif
            }
    }
};

#  ifdef ONE_STEP
#    define g0_fun(x) ( mk_tab[0][byte(x,0)] ^ mk_tab[1][byte(x,1)] \
                      ^ mk_tab[2][byte(x,2)] ^ mk_tab[3][byte(x,3)] )
#    define g1_fun(x) ( mk_tab[0][byte(x,3)] ^ mk_tab[1][byte(x,0)] \
                      ^ mk_tab[2][byte(x,1)] ^ mk_tab[3][byte(x,2)] )
#  else
#    define g0_fun(x) ( mds(0, sb[0][byte(x,0)]) ^ mds(1, sb[1][byte(x,1)]) \
                      ^ mds(2, sb[2][byte(x,2)]) ^ mds(3, sb[3][byte(x,3)]) )
#    define g1_fun(x) ( mds(0, sb[0][byte(x,3)]) ^ mds(1, sb[1][byte(x,0)]) \
                      ^ mds(2, sb[2][byte(x,1)]) ^ mds(3, sb[3][byte(x,2)]) )
#  endif

#else

#define g0_fun(x)   h_fun(x,s_key)
#define g1_fun(x)   h_fun(rotl(x,8),s_key)

#endif

/* The (12,8) Reed Soloman code has the generator polynomial

  g(x) = x^4 + (a + 1/a) * x^3 + a * x^2 + (a + 1/a) * x + 1

where the coefficients are in the finite field GF(2^8) with a
modular polynomial a^8 + a^6 + a^3 + a^2 + 1. To generate the
remainder we have to start with a 12th order polynomial with our
eight input bytes as the coefficients of the 4th to 11th terms.
That is:

  m[7] * x^11 + m[6] * x^10 ... + m[0] * x^4 + 0 * x^3 +... + 0

We then multiply the generator polynomial by m[7] * x^7 and subtract
it - xor in GF(2^8) - from the above to eliminate the x^7 term (the
artihmetic on the coefficients is done in GF(2^8). We then multiply
the generator polynomial by x^6 * coeff(x^10) and use this to remove
the x^10 term. We carry on in this way until the x^4 term is removed
so that we are left with:

  r[3] * x^3 + r[2] * x^2 + r[1] 8 x^1 + r[0]

which give the resulting 4 bytes of the remainder. This is equivalent
to the matrix multiplication in the Twofish description but much faster
to implement.

*/

#define G_MOD   0x0000014d

u4byte mds_rem(u4byte p0, u4byte p1)
{   u4byte  i, t, u;

    for(i = 0; i < 8; ++i)
    {
        t = p1 >> 24;   // get most significant coefficient

        p1 = (p1 << 8) | (p0 >> 24); p0 <<= 8;  // shift others up

        // multiply t by a (the primitive element - i.e. left shift)

        u = (t << 1);

        if(t & 0x80)            // subtract modular polynomial on overflow

            u ^= G_MOD;

        p1 ^= t ^ (u << 16);    // remove t * (a * x^2 + 1)

        u ^= (t >> 1);          // form u = a * t + t / a = t * (a + 1 / a);

        if(t & 0x01)            // add the modular polynomial on underflow

            u ^= G_MOD >> 1;

        p1 ^= (u << 24) | (u << 8); // remove t * (a + 1/a) * (x^3 + x)
    }

    return p1;
};

/* initialise the key schedule from the user supplied key   */

u4byte *set_key(const u4byte in_key[], const u4byte key_len)
{   u4byte  i, a, b, me_key[4], mo_key[4];

#ifdef Q_TABLES
    if(!qt_gen)
    {
        gen_qtab(); qt_gen = 1;
    }
#endif

#ifdef M_TABLE
    if(!mt_gen)
    {
        gen_mtab(); mt_gen = 1;
    }
#endif

    k_len = key_len / 64;   /* 2, 3 or 4 */

    for(i = 0; i < k_len; ++i)
    {
        a = in_key[i + i];     me_key[i] = a;
        b = in_key[i + i + 1]; mo_key[i] = b;
        s_key[k_len - i - 1] = mds_rem(a, b);
    }

    for(i = 0; i < 40; i += 2)
    {
        a = 0x01010101 * i; b = a + 0x01010101;
        a = h_fun(a, me_key);
        b = rotl(h_fun(b, mo_key), 8);
        l_key[i] = a + b;
        l_key[i + 1] = rotl(a + 2 * b, 9);
    }

#ifdef MK_TABLE
    gen_mk_tab(s_key);
#endif

    return l_key;
};

/* encrypt a block of text  */

#define f_rnd(i)                                                    \
    t1 = g1_fun(blk[1]); t0 = g0_fun(blk[0]);                       \
    blk[2] = rotr(blk[2] ^ (t0 + t1 + l_key[4 * (i) + 8]), 1);      \
    blk[3] = rotl(blk[3], 1) ^ (t0 + 2 * t1 + l_key[4 * (i) + 9]);  \
    t1 = g1_fun(blk[3]); t0 = g0_fun(blk[2]);                       \
    blk[0] = rotr(blk[0] ^ (t0 + t1 + l_key[4 * (i) + 10]), 1);     \
    blk[1] = rotl(blk[1], 1) ^ (t0 + 2 * t1 + l_key[4 * (i) + 11])

[COLOR=royalblue][B]__device__[/B] void encrypt(const u4byte in_blk[4], u4byte out_blk[4])
{   u4byte  t0, t1, blk[4];

    blk[0] = in_blk[0] ^ l_key[0];
    blk[1] = in_blk[1] ^ l_key[1];
    blk[2] = in_blk[2] ^ l_key[2];
    blk[3] = in_blk[3] ^ l_key[3];

    f_rnd(0); f_rnd(1); f_rnd(2); f_rnd(3);
    f_rnd(4); f_rnd(5); f_rnd(6); f_rnd(7);

    out_blk[0] = blk[2] ^ l_key[4];
    out_blk[1] = blk[3] ^ l_key[5];
    out_blk[2] = blk[0] ^ l_key[6];
    out_blk[3] = blk[1] ^ l_key[7];
};

/* decrypt a block of text  */

#define i_rnd(i)                                                        \
        t1 = g1_fun(blk[1]); t0 = g0_fun(blk[0]);                       \
        blk[2] = rotl(blk[2], 1) ^ (t0 + t1 + l_key[4 * (i) + 10]);     \
        blk[3] = rotr(blk[3] ^ (t0 + 2 * t1 + l_key[4 * (i) + 11]), 1); \
        t1 = g1_fun(blk[3]); t0 = g0_fun(blk[2]);                       \
        blk[0] = rotl(blk[0], 1) ^ (t0 + t1 + l_key[4 * (i) +  8]);     \
        blk[1] = rotr(blk[1] ^ (t0 + 2 * t1 + l_key[4 * (i) +  9]), 1)

[COLOR=royalblue][B]__device__ [/B]void decrypt(const u4byte in_blk[4], u4byte out_blk[4])
{   u4byte  t0, t1, blk[4];

    blk[0] = in_blk[0] ^ l_key[4];
    blk[1] = in_blk[1] ^ l_key[5];
    blk[2] = in_blk[2] ^ l_key[6];
    blk[3] = in_blk[3] ^ l_key[7];

    i_rnd(7); i_rnd(6); i_rnd(5); i_rnd(4);
    i_rnd(3); i_rnd(2); i_rnd(1); i_rnd(0);

    out_blk[0] = blk[2] ^ l_key[0];
    out_blk[1] = blk[3] ^ l_key[1];
    out_blk[2] = blk[0] ^ l_key[2];
    out_blk[3] = blk[1] ^ l_key[3];
};
[COLOR=royalblue][B]
__global__ void twofish_enc(const u4byte in_blk[4], u4byte out_blk[4]) {
    int b = blockIdx.x*blockDim.x + threadIdx.x;
    encrypt(in_blk[4] + 16*b, out_blk[4] + 16*b);
}

__global__ void twofish_dec(const u4byte in_blk[4], u4byte out_blk[4]) {
    int b = blockIdx.x*blockDim.x + threadIdx.x;
    decrypt(in_blk[4] + 16*b, out_blk[4] + 16*b);
}

inline void HandleErrorImpl(cudaError error, const char* file, int line) {
    if (error != cudaSuccess) {
        printf("%s: %d %s", file, line, cudaGetErrorString(error));
        exit(1);
    }
}

int main(void) {
    char* key = "meingeheimespasswort";
    int threads = 512;
    int blocks = 12800;

    u4byte* in_blk[4];
    HandleError(cudaMalloc(&in_blk[4], 16*threads*blocks));
    HandleError(cudaMemset(in_blk[4], 0, 16*threads*blocks));

    u4byte* out_blk[4];
    HandleError(cudaMalloc(&out_blk[4], 16*threads*blocks));

    char *test1 = (char*)malloc(16*threads*blocks);
    HandleError(cudaMemcpy(test1, in_blk[4], 16*threads*blocks, cudaMemcpyDeviceToHost));

    printf("First 16 bytes of input data: ");
    for (int i=0; i<16; i++) {
        printf("%d, ", test1[i]);
    }

    cudaEvent_t start, stop;
    HandleError(cudaEventCreate(&start));
    HandleError(cudaEventCreate(&stop));
    HandleError(cudaEventRecord( start, 0 ));

    twofish_enc<<<blocks,threads>>>(in_blk[4], out_blk[4]);
    twofish_dec<<<blocks,threads>>>(out_blk[4], in_blk[4]);

    HandleError(cudaGetLastError());

    HandleError(cudaEventRecord( stop, 0 ));
    HandleError(cudaEventSynchronize( stop ));
    float elapsedTime;
    HandleError(cudaEventElapsedTime( &elapsedTime, start, stop ));

    printf("\nPerformance: %.6f MB/sec\n", 2*((16*threads*blocks)/1048576)/(elapsedTime/1000));

    HandleError(cudaEventDestroy( start ));
    HandleError(cudaEventDestroy( stop ));

    char *test2 = (char*)malloc(16*threads*blocks);
    HandleError(cudaMemcpy(test2, in_blk[4], 16*threads*blocks, cudaMemcpyDeviceToHost));

    printf("After decryption: ");
    for (int i=0; i<16; i++) {
        printf("%d, ", test2[i]);
    }

    free(test1);
    free(test2);

    HandleError(cudaFree(in_blk[4]));
    HandleError(cudaFree(out_blk[4]));

    return 0;
}[/B]
snapshot40.png

^^ Wenn wir die noch wegbringen, fehlt in Twofish m. E. nur noch die Schlüsselinitialisierung ...

[Edit]
Key-Initialisierung habe ich in der main gerade noch so gelöst:
Code:
set_key((const u4byte*)key, strlen(key));
Scheint zu funktionieren.
 
AW: [CUDA] RC6 / Serpent /Twofish

Du kannst nicht einfach überall das mit dem Array so stehen lassen. Deine twofish_enc/dec Funktionen sollten als Parameter nur ein Pointer auf ein u4byte haben. Außerdem muss dann die Berechnung der Positionen etwas angepasst werden. In Twofish ist ein Block scheinbar auch 16 Bytes aber du arbeitest hier schon mit 4 Byte großen Typen also musst du pro Block nur 4 Werte weiter gehen. So würde denn etwa die twofish_enc Funktion aussehen:
Code:
__global__ void twofish_enc(const u4byte* in_blk, u4byte* out_blk) {
    int b = blockIdx.x*blockDim.x + threadIdx.x;
    encrypt(in_blk + 4*b, out_blk + 4*b);
}
Gleiche Sache in der main Funktion. Du willst kein Array mit vier u4byte Pointern drin sondern nur ein Pointer auf u4byte. So etwa
Code:
    u4byte* in_blk;
    HandleError(cudaMalloc(&in_blk, 16*threads*blocks));
    HandleError(cudaMemset(in_blk, 0, 16*threads*blocks));
Die Sache mit der Schlüsselinitialisierung scheint wieder etwas fummeliger zu sein, da der Schlüssel nicht an die Funktionen übergeben wird sondern als globale Variable vorliegt (twofish.cu Zeile 55-57). Ich würde sagen das einfachste ist diese 3 Variablen mit __device__ zu markieren und die set_key Funktion auch zu einer device Funktion zu machen.

Edit: Irgendwie scheint diese Implementierung doch nicht so besonders gut geeignet zu sein nach CUDA portiert zu werden, da es viele globale Variablen gibt.
 
Zuletzt bearbeitet:
AW: [CUDA] RC6 / Serpent /Twofish

So, das hat jetzt die alten Errors weg gebracht, aber neue mit sich gebracht ... *seufz*
Code:
/* This is an independent implementation of the encryption algorithm:   */
/*                                                                      */
/*         Twofish by Bruce Schneier and colleagues                     */
/*                                                                      */
/* which is a candidate algorithm in the Advanced Encryption Standard   */
/* programme of the US National Institute of Standards and Technology.  */
/*                                                                      */
/* Copyright in this implementation is held by Dr B R Gladman but I     */
/* hereby give permission for its free direct or derivative use subject */
/* to acknowledgment of its origin and compliance with any conditions   */
/* that the originators of t he algorithm place on its exploitation.    */
/*                                                                      */
/* My thanks to Doug Whiting and Niels Ferguson for comments that led   */
/* to improvements in this implementation.                              */
/*                                                                      */
/* Dr Brian Gladman (gladman@seven77.demon.co.uk) 14th January 1999     */

/* Timing data for Twofish (twofish.c)

128 bit key:
Key Setup:    8414 cycles
Encrypt:       376 cycles =    68.1 mbits/sec
Decrypt:       374 cycles =    68.4 mbits/sec
Mean:          375 cycles =    68.3 mbits/sec

192 bit key:
Key Setup:   11628 cycles
Encrypt:       376 cycles =    68.1 mbits/sec
Decrypt:       374 cycles =    68.4 mbits/sec
Mean:          375 cycles =    68.3 mbits/sec

256 bit key:
Key Setup:   15457 cycles
Encrypt:       381 cycles =    67.2 mbits/sec
Decrypt:       374 cycles =    68.4 mbits/sec
Mean:          378 cycles =    67.8 mbits/sec

*/

#include "std_defs.h"
#include <stdio.h>

#define Q_TABLES
#define M_TABLE
#define MK_TABLE
#define ONE_STEP

#define HandleError(x) HandleErrorImpl(x, __FILE__, __LINE__)

static char *alg_name[] = { "twofish", "twofish.c", "twofish" };

char **cipher_name()
{
    return alg_name;
}

__device__ u4byte  k_len;
__device__ u4byte  l_key[40];
__device__ u4byte  s_key[4];

/* finite field arithmetic for GF(2**8) with the modular    */
/* polynomial x^8 + x^6 + x^5 + x^3 + 1 (0x169)             */

#define G_M 0x0169

u1byte  tab_5b[4] = { 0, G_M >> 2, G_M >> 1, (G_M >> 1) ^ (G_M >> 2) };
u1byte  tab_ef[4] = { 0, (G_M >> 1) ^ (G_M >> 2), G_M >> 1, G_M >> 2 };

#define ffm_01(x)    (x)
#define ffm_5b(x)   ((x) ^ ((x) >> 2) ^ tab_5b[(x) & 3])
#define ffm_ef(x)   ((x) ^ ((x) >> 1) ^ ((x) >> 2) ^ tab_ef[(x) & 3])

u1byte ror4[16] = { 0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15 };
u1byte ashx[16] = { 0, 9, 2, 11, 4, 13, 6, 15, 8, 1, 10, 3, 12, 5, 14, 7 };

u1byte qt0[2][16] =
{   { 8, 1, 7, 13, 6, 15, 3, 2, 0, 11, 5, 9, 14, 12, 10, 4 },
    { 2, 8, 11, 13, 15, 7, 6, 14, 3, 1, 9, 4, 0, 10, 12, 5 }
};

u1byte qt1[2][16] =
{   { 14, 12, 11, 8, 1, 2, 3, 5, 15, 4, 10, 6, 7, 0, 9, 13 },
    { 1, 14, 2, 11, 4, 12, 3, 7, 6, 13, 10, 5, 15, 9, 0, 8 }
};

u1byte qt2[2][16] =
{   { 11, 10, 5, 14, 6, 13, 9, 0, 12, 8, 15, 3, 2, 4, 7, 1 },
    { 4, 12, 7, 5, 1, 6, 9, 10, 0, 14, 13, 8, 2, 11, 3, 15 }
};

u1byte qt3[2][16] =
{   { 13, 7, 15, 4, 1, 2, 6, 14, 9, 11, 3, 0, 8, 5, 12, 10 },
    { 11, 9, 5, 1, 12, 3, 13, 14, 6, 4, 7, 15, 2, 0, 8, 10 }
};

u1byte qp(const u4byte n, const u1byte x)
{   u1byte  a0, a1, a2, a3, a4, b0, b1, b2, b3, b4;

    a0 = x >> 4; b0 = x & 15;
    a1 = a0 ^ b0; b1 = ror4[b0] ^ ashx[a0];
    a2 = qt0[n][a1]; b2 = qt1[n][b1];
    a3 = a2 ^ b2; b3 = ror4[b2] ^ ashx[a2];
    a4 = qt2[n][a3]; b4 = qt3[n][b3];
    return (b4 << 4) | a4;
};

#ifdef  Q_TABLES

u4byte  qt_gen = 0;
u1byte  q_tab[2][256];

#define q(n,x)  q_tab[n][x]

void gen_qtab(void)
{   u4byte  i;

    for(i = 0; i < 256; ++i)
    {
        q(0,i) = qp(0, (u1byte)i);
        q(1,i) = qp(1, (u1byte)i);
    }
};

#else

#define q(n,x)  qp(n, x)

#endif

#ifdef  M_TABLE

u4byte  mt_gen = 0;
u4byte  m_tab[4][256];

void gen_mtab(void)
{   u4byte  i, f01, f5b, fef;

    for(i = 0; i < 256; ++i)
    {
        f01 = q(1,i); f5b = ffm_5b(f01); fef = ffm_ef(f01);
        m_tab[0][i] = f01 + (f5b << 8) + (fef << 16) + (fef << 24);
        m_tab[2][i] = f5b + (fef << 8) + (f01 << 16) + (fef << 24);

        f01 = q(0,i); f5b = ffm_5b(f01); fef = ffm_ef(f01);
        m_tab[1][i] = fef + (fef << 8) + (f5b << 16) + (f01 << 24);
        m_tab[3][i] = f5b + (f01 << 8) + (fef << 16) + (f5b << 24);
    }
};

#define mds(n,x)    m_tab[n][x]

#else

#define fm_00   ffm_01
#define fm_10   ffm_5b
#define fm_20   ffm_ef
#define fm_30   ffm_ef
#define q_0(x)  q(1,x)

#define fm_01   ffm_ef
#define fm_11   ffm_ef
#define fm_21   ffm_5b
#define fm_31   ffm_01
#define q_1(x)  q(0,x)

#define fm_02   ffm_5b
#define fm_12   ffm_ef
#define fm_22   ffm_01
#define fm_32   ffm_ef
#define q_2(x)  q(1,x)

#define fm_03   ffm_5b
#define fm_13   ffm_01
#define fm_23   ffm_ef
#define fm_33   ffm_5b
#define q_3(x)  q(0,x)

#define f_0(n,x)    ((u4byte)fm_0##n(x))
#define f_1(n,x)    ((u4byte)fm_1##n(x) << 8)
#define f_2(n,x)    ((u4byte)fm_2##n(x) << 16)
#define f_3(n,x)    ((u4byte)fm_3##n(x) << 24)

#define mds(n,x)    f_0(n,q_##n(x)) ^ f_1(n,q_##n(x)) ^ f_2(n,q_##n(x)) ^ f_3(n,q_##n(x))

#endif

u4byte h_fun(const u4byte x, const u4byte key[])
{   u4byte  b0, b1, b2, b3;

#ifndef M_TABLE
    u4byte  m5b_b0, m5b_b1, m5b_b2, m5b_b3;
    u4byte  mef_b0, mef_b1, mef_b2, mef_b3;
#endif

    b0 = byte(x, 0); b1 = byte(x, 1); b2 = byte(x, 2); b3 = byte(x, 3);

    switch(k_len)
    {
    case 4: b0 = q(1, b0) ^ byte(key[3],0);
            b1 = q(0, b1) ^ byte(key[3],1);
            b2 = q(0, b2) ^ byte(key[3],2);
            b3 = q(1, b3) ^ byte(key[3],3);
    case 3: b0 = q(1, b0) ^ byte(key[2],0);
            b1 = q(1, b1) ^ byte(key[2],1);
            b2 = q(0, b2) ^ byte(key[2],2);
            b3 = q(0, b3) ^ byte(key[2],3);
    case 2: b0 = q(0,q(0,b0) ^ byte(key[1],0)) ^ byte(key[0],0);
            b1 = q(0,q(1,b1) ^ byte(key[1],1)) ^ byte(key[0],1);
            b2 = q(1,q(0,b2) ^ byte(key[1],2)) ^ byte(key[0],2);
            b3 = q(1,q(1,b3) ^ byte(key[1],3)) ^ byte(key[0],3);
    }
#ifdef  M_TABLE

    return  mds(0, b0) ^ mds(1, b1) ^ mds(2, b2) ^ mds(3, b3);

#else

    b0 = q(1, b0); b1 = q(0, b1); b2 = q(1, b2); b3 = q(0, b3);
    m5b_b0 = ffm_5b(b0); m5b_b1 = ffm_5b(b1); m5b_b2 = ffm_5b(b2); m5b_b3 = ffm_5b(b3);
    mef_b0 = ffm_ef(b0); mef_b1 = ffm_ef(b1); mef_b2 = ffm_ef(b2); mef_b3 = ffm_ef(b3);
    b0 ^= mef_b1 ^ m5b_b2 ^ m5b_b3; b3 ^= m5b_b0 ^ mef_b1 ^ mef_b2;
    b2 ^= mef_b0 ^ m5b_b1 ^ mef_b3; b1 ^= mef_b0 ^ mef_b2 ^ m5b_b3;

    return b0 | (b3 << 8) | (b2 << 16) | (b1 << 24);

#endif
};

#ifdef  MK_TABLE

#ifdef  ONE_STEP
u4byte  mk_tab[4][256];
#else
u1byte  sb[4][256];
#endif

#define q20(x)  q(0,q(0,x) ^ byte(key[1],0)) ^ byte(key[0],0)
#define q21(x)  q(0,q(1,x) ^ byte(key[1],1)) ^ byte(key[0],1)
#define q22(x)  q(1,q(0,x) ^ byte(key[1],2)) ^ byte(key[0],2)
#define q23(x)  q(1,q(1,x) ^ byte(key[1],3)) ^ byte(key[0],3)

#define q30(x)  q(0,q(0,q(1, x) ^ byte(key[2],0)) ^ byte(key[1],0)) ^ byte(key[0],0)
#define q31(x)  q(0,q(1,q(1, x) ^ byte(key[2],1)) ^ byte(key[1],1)) ^ byte(key[0],1)
#define q32(x)  q(1,q(0,q(0, x) ^ byte(key[2],2)) ^ byte(key[1],2)) ^ byte(key[0],2)
#define q33(x)  q(1,q(1,q(0, x) ^ byte(key[2],3)) ^ byte(key[1],3)) ^ byte(key[0],3)

#define q40(x)  q(0,q(0,q(1, q(1, x) ^ byte(key[3],0)) ^ byte(key[2],0)) ^ byte(key[1],0)) ^ byte(key[0],0)
#define q41(x)  q(0,q(1,q(1, q(0, x) ^ byte(key[3],1)) ^ byte(key[2],1)) ^ byte(key[1],1)) ^ byte(key[0],1)
#define q42(x)  q(1,q(0,q(0, q(0, x) ^ byte(key[3],2)) ^ byte(key[2],2)) ^ byte(key[1],2)) ^ byte(key[0],2)
#define q43(x)  q(1,q(1,q(0, q(1, x) ^ byte(key[3],3)) ^ byte(key[2],3)) ^ byte(key[1],3)) ^ byte(key[0],3)

void gen_mk_tab(u4byte key[])
{   u4byte  i;
    u1byte  by;

    switch(k_len)
    {
    case 2: for(i = 0; i < 256; ++i)
            {
                by = (u1byte)i;
#ifdef ONE_STEP
                mk_tab[0][i] = mds(0, q20(by)); mk_tab[1][i] = mds(1, q21(by));
                mk_tab[2][i] = mds(2, q22(by)); mk_tab[3][i] = mds(3, q23(by));
#else
                sb[0][i] = q20(by); sb[1][i] = q21(by);
                sb[2][i] = q22(by); sb[3][i] = q23(by);
#endif
            }
            break;

    case 3: for(i = 0; i < 256; ++i)
            {
                by = (u1byte)i;
#ifdef ONE_STEP
                mk_tab[0][i] = mds(0, q30(by)); mk_tab[1][i] = mds(1, q31(by));
                mk_tab[2][i] = mds(2, q32(by)); mk_tab[3][i] = mds(3, q33(by));
#else
                sb[0][i] = q30(by); sb[1][i] = q31(by);
                sb[2][i] = q32(by); sb[3][i] = q33(by);
#endif
            }
            break;

    case 4: for(i = 0; i < 256; ++i)
            {
                by = (u1byte)i;
#ifdef ONE_STEP
                mk_tab[0][i] = mds(0, q40(by)); mk_tab[1][i] = mds(1, q41(by));
                mk_tab[2][i] = mds(2, q42(by)); mk_tab[3][i] = mds(3, q43(by));
#else
                sb[0][i] = q40(by); sb[1][i] = q41(by);
                sb[2][i] = q42(by); sb[3][i] = q43(by);
#endif
            }
    }
};

#  ifdef ONE_STEP
#    define g0_fun(x) ( mk_tab[0][byte(x,0)] ^ mk_tab[1][byte(x,1)] \
                      ^ mk_tab[2][byte(x,2)] ^ mk_tab[3][byte(x,3)] )
#    define g1_fun(x) ( mk_tab[0][byte(x,3)] ^ mk_tab[1][byte(x,0)] \
                      ^ mk_tab[2][byte(x,1)] ^ mk_tab[3][byte(x,2)] )
#  else
#    define g0_fun(x) ( mds(0, sb[0][byte(x,0)]) ^ mds(1, sb[1][byte(x,1)]) \
                      ^ mds(2, sb[2][byte(x,2)]) ^ mds(3, sb[3][byte(x,3)]) )
#    define g1_fun(x) ( mds(0, sb[0][byte(x,3)]) ^ mds(1, sb[1][byte(x,0)]) \
                      ^ mds(2, sb[2][byte(x,1)]) ^ mds(3, sb[3][byte(x,2)]) )
#  endif

#else

#define g0_fun(x)   h_fun(x,s_key)
#define g1_fun(x)   h_fun(rotl(x,8),s_key)

#endif

/* The (12,8) Reed Soloman code has the generator polynomial

  g(x) = x^4 + (a + 1/a) * x^3 + a * x^2 + (a + 1/a) * x + 1

where the coefficients are in the finite field GF(2^8) with a
modular polynomial a^8 + a^6 + a^3 + a^2 + 1. To generate the
remainder we have to start with a 12th order polynomial with our
eight input bytes as the coefficients of the 4th to 11th terms.
That is:

  m[7] * x^11 + m[6] * x^10 ... + m[0] * x^4 + 0 * x^3 +... + 0

We then multiply the generator polynomial by m[7] * x^7 and subtract
it - xor in GF(2^8) - from the above to eliminate the x^7 term (the
artihmetic on the coefficients is done in GF(2^8). We then multiply
the generator polynomial by x^6 * coeff(x^10) and use this to remove
the x^10 term. We carry on in this way until the x^4 term is removed
so that we are left with:

  r[3] * x^3 + r[2] * x^2 + r[1] 8 x^1 + r[0]

which give the resulting 4 bytes of the remainder. This is equivalent
to the matrix multiplication in the Twofish description but much faster
to implement.

*/

#define G_MOD   0x0000014d

u4byte mds_rem(u4byte p0, u4byte p1)
{   u4byte  i, t, u;

    for(i = 0; i < 8; ++i)
    {
        t = p1 >> 24;   // get most significant coefficient

        p1 = (p1 << 8) | (p0 >> 24); p0 <<= 8;  // shift others up

        // multiply t by a (the primitive element - i.e. left shift)

        u = (t << 1);

        if(t & 0x80)            // subtract modular polynomial on overflow

            u ^= G_MOD;

        p1 ^= t ^ (u << 16);    // remove t * (a * x^2 + 1)

        u ^= (t >> 1);          // form u = a * t + t / a = t * (a + 1 / a);

        if(t & 0x01)            // add the modular polynomial on underflow

            u ^= G_MOD >> 1;

        p1 ^= (u << 24) | (u << 8); // remove t * (a + 1/a) * (x^3 + x)
    }

    return p1;
};

/* initialise the key schedule from the user supplied key   */

__device__ u4byte *set_key(const u4byte in_key[], const u4byte key_len)
{   u4byte  i, a, b, me_key[4], mo_key[4];

#ifdef Q_TABLES
    [COLOR=red][B]if(!qt_gen)[/B]    {
        gen_qtab(); qt_gen = 1;
    }
#endif

#ifdef M_TABLE
    [COLOR=red][B]if(!mt_gen)[/B]    {
        gen_mtab(); mt_gen = 1;
    }
#endif

    k_len = key_len / 64;   /* 2, 3 or 4 */

    for(i = 0; i < k_len; ++i)
    {
        a = in_key[i + i];     me_key[i] = a;
        b = in_key[i + i + 1]; mo_key[i] = b;
        s_key[k_len - i - 1] = mds_rem(a, b);
    }

    for(i = 0; i < 40; i += 2)
    {
        a = 0x01010101 * i; b = a + 0x01010101;
        a = h_fun(a, me_key);
        b = rotl(h_fun(b, mo_key), 8);
        l_key[i] = a + b;
        l_key[i + 1] = rotl(a + 2 * b, 9);
    }

#ifdef MK_TABLE
    gen_mk_tab(s_key);
#endif

    return l_key;
};

/* encrypt a block of text  */

#define f_rnd(i)                                                    \
    t1 = g1_fun(blk[1]); t0 = g0_fun(blk[0]);                       \
    blk[2] = rotr(blk[2] ^ (t0 + t1 + l_key[4 * (i) + 8]), 1);      \
    blk[3] = rotl(blk[3], 1) ^ (t0 + 2 * t1 + l_key[4 * (i) + 9]);  \
    t1 = g1_fun(blk[3]); t0 = g0_fun(blk[2]);                       \
    blk[0] = rotr(blk[0] ^ (t0 + t1 + l_key[4 * (i) + 10]), 1);     \
    blk[1] = rotl(blk[1], 1) ^ (t0 + 2 * t1 + l_key[4 * (i) + 11])

__device__ void encrypt(const u4byte in_blk[4], u4byte out_blk[4])
{   u4byte  t0, t1, blk[4];

    blk[0] = in_blk[0] ^ l_key[0];
    blk[1] = in_blk[1] ^ l_key[1];
    blk[2] = in_blk[2] ^ l_key[2];
    blk[3] = in_blk[3] ^ l_key[3];

    [COLOR=red][B]f_rnd(0); f_rnd(1); f_rnd(2); f_rnd(3);[/B]    f_rnd(4); f_rnd(5); f_rnd(6); f_rnd(7);

    out_blk[0] = blk[2] ^ l_key[4];
    out_blk[1] = blk[3] ^ l_key[5];
    out_blk[2] = blk[0] ^ l_key[6];
    out_blk[3] = blk[1] ^ l_key[7];
};

/* decrypt a block of text  */

#define i_rnd(i)                                                        \
        t1 = g1_fun(blk[1]); t0 = g0_fun(blk[0]);                       \
        blk[2] = rotl(blk[2], 1) ^ (t0 + t1 + l_key[4 * (i) + 10]);     \
        blk[3] = rotr(blk[3] ^ (t0 + 2 * t1 + l_key[4 * (i) + 11]), 1); \
        t1 = g1_fun(blk[3]); t0 = g0_fun(blk[2]);                       \
        blk[0] = rotl(blk[0], 1) ^ (t0 + t1 + l_key[4 * (i) +  8]);     \
        blk[1] = rotr(blk[1] ^ (t0 + 2 * t1 + l_key[4 * (i) +  9]), 1)

__device__ void decrypt(const u4byte in_blk[4], u4byte out_blk[4])
{   u4byte  t0, t1, blk[4];

    blk[0] = in_blk[0] ^ l_key[4];
    blk[1] = in_blk[1] ^ l_key[5];
    blk[2] = in_blk[2] ^ l_key[6];
    blk[3] = in_blk[3] ^ l_key[7];

   [COLOR=red][B] i_rnd(7); i_rnd(6); i_rnd(5); i_rnd(4);[/B]    i_rnd(3); i_rnd(2); i_rnd(1); i_rnd(0);

    out_blk[0] = blk[2] ^ l_key[0];
    out_blk[1] = blk[3] ^ l_key[1];
    out_blk[2] = blk[0] ^ l_key[2];
    out_blk[3] = blk[1] ^ l_key[3];
};

__global__ void twofish_enc(const u4byte in_blk[4], u4byte out_blk[4]) {
    int b = blockIdx.x*blockDim.x + threadIdx.x;
    encrypt(in_blk + 4*b, out_blk + 4*b);
}

__global__ void twofish_dec(const u4byte in_blk[4], u4byte out_blk[4]) {
    int b = blockIdx.x*blockDim.x + threadIdx.x;
    decrypt(in_blk + 4*b, out_blk + 4*b);
}

inline void HandleErrorImpl(cudaError error, const char* file, int line) {
    if (error != cudaSuccess) {
        printf("%s: %d %s", file, line, cudaGetErrorString(error));
        exit(1);
    }
}

int main(void) {
    char* key = "meingeheimespasswort";
    int threads = 512;
    int blocks = 12800;
    set_key((const u4byte*)key, strlen(key));

    u4byte* in_blk;
    HandleError(cudaMalloc(&in_blk, 16*threads*blocks));
    HandleError(cudaMemset(in_blk, 0, 16*threads*blocks));

    u4byte* out_blk;
    HandleError(cudaMalloc(&out_blk, 16*threads*blocks));

    char *test1 = (char*)malloc(16*threads*blocks);
    HandleError(cudaMemcpy(test1, in_blk, 16*threads*blocks, cudaMemcpyDeviceToHost));

    printf("First 16 bytes of input data: ");
    for (int i=0; i<16; i++) {
        printf("%d, ", test1[i]);
    }

    cudaEvent_t start, stop;
    HandleError(cudaEventCreate(&start));
    HandleError(cudaEventCreate(&stop));
    HandleError(cudaEventRecord( start, 0 ));

    twofish_enc<<<blocks,threads>>>(in_blk, out_blk);
    twofish_dec<<<blocks,threads>>>(out_blk, in_blk);

    HandleError(cudaGetLastError());

    HandleError(cudaEventRecord( stop, 0 ));
    HandleError(cudaEventSynchronize( stop ));
    float elapsedTime;
    HandleError(cudaEventElapsedTime( &elapsedTime, start, stop ));

    printf("\nPerformance: %.6f MB/sec\n", 2*((16*threads*blocks)/1048576)/(elapsedTime/1000));

    HandleError(cudaEventDestroy( start ));
    HandleError(cudaEventDestroy( stop ));

    char *test2 = (char*)malloc(16*threads*blocks);
    HandleError(cudaMemcpy(test2, in_blk, 16*threads*blocks, cudaMemcpyDeviceToHost));

    printf("After decryption: ");
    for (int i=0; i<16; i++) {
        printf("%d, ", test2[i]);
    }

    free(test1);
    free(test2);

    HandleError(cudaFree(in_blk));
    HandleError(cudaFree(out_blk));

    return 0;
}
^^ In den 4 roten Zeilen heißt's jetzt, dass die jeweiligen identifier in device code undefined sind ...
 
AW: [CUDA] RC6 / Serpent /Twofish

Ja musste mal die Datei durchgehen und an alle globalen Variablen ein __device__ klatschen und hoffen, dass es geht.
 
AW: [CUDA] RC6 / Serpent /Twofish

Ich hoffe, ich habe alles richtig erwischt, aber bis auf zig warnings schaut's jetzt ganz gut aus:
Code:
/* This is an independent implementation of the encryption algorithm:   */
/*                                                                      */
/*         Twofish by Bruce Schneier and colleagues                     */
/*                                                                      */
/* which is a candidate algorithm in the Advanced Encryption Standard   */
/* programme of the US National Institute of Standards and Technology.  */
/*                                                                      */
/* Copyright in this implementation is held by Dr B R Gladman but I     */
/* hereby give permission for its free direct or derivative use subject */
/* to acknowledgment of its origin and compliance with any conditions   */
/* that the originators of t he algorithm place on its exploitation.    */
/*                                                                      */
/* My thanks to Doug Whiting and Niels Ferguson for comments that led   */
/* to improvements in this implementation.                              */
/*                                                                      */
/* Dr Brian Gladman (gladman@seven77.demon.co.uk) 14th January 1999     */

/* Timing data for Twofish (twofish.c)

128 bit key:
Key Setup:    8414 cycles
Encrypt:       376 cycles =    68.1 mbits/sec
Decrypt:       374 cycles =    68.4 mbits/sec
Mean:          375 cycles =    68.3 mbits/sec

192 bit key:
Key Setup:   11628 cycles
Encrypt:       376 cycles =    68.1 mbits/sec
Decrypt:       374 cycles =    68.4 mbits/sec
Mean:          375 cycles =    68.3 mbits/sec

256 bit key:
Key Setup:   15457 cycles
Encrypt:       381 cycles =    67.2 mbits/sec
Decrypt:       374 cycles =    68.4 mbits/sec
Mean:          378 cycles =    67.8 mbits/sec

*/

#include "std_defs.h"
#include <stdio.h>

#define Q_TABLES
#define M_TABLE
#define MK_TABLE
#define ONE_STEP

#define HandleError(x) HandleErrorImpl(x, __FILE__, __LINE__)

static char *alg_name[] = { "twofish", "twofish.c", "twofish" };

char **cipher_name()
{
    return alg_name;
}

__device__ u4byte  k_len;
__device__ u4byte  l_key[40];
__device__ u4byte  s_key[4];

/* finite field arithmetic for GF(2**8) with the modular    */
/* polynomial x^8 + x^6 + x^5 + x^3 + 1 (0x169)             */

#define G_M 0x0169

__device__ u1byte  tab_5b[4] = { 0, G_M >> 2, G_M >> 1, (G_M >> 1) ^ (G_M >> 2) };
__device__ u1byte  tab_ef[4] = { 0, (G_M >> 1) ^ (G_M >> 2), G_M >> 1, G_M >> 2 };

#define ffm_01(x)    (x)
#define ffm_5b(x)   ((x) ^ ((x) >> 2) ^ tab_5b[(x) & 3])
#define ffm_ef(x)   ((x) ^ ((x) >> 1) ^ ((x) >> 2) ^ tab_ef[(x) & 3])

__device__ u1byte ror4[16] = { 0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15 };
__device__ u1byte ashx[16] = { 0, 9, 2, 11, 4, 13, 6, 15, 8, 1, 10, 3, 12, 5, 14, 7 };

__device__ u1byte qt0[2][16] =
{   { 8, 1, 7, 13, 6, 15, 3, 2, 0, 11, 5, 9, 14, 12, 10, 4 },
    { 2, 8, 11, 13, 15, 7, 6, 14, 3, 1, 9, 4, 0, 10, 12, 5 }
};

__device__ u1byte qt1[2][16] =
{   { 14, 12, 11, 8, 1, 2, 3, 5, 15, 4, 10, 6, 7, 0, 9, 13 },
    { 1, 14, 2, 11, 4, 12, 3, 7, 6, 13, 10, 5, 15, 9, 0, 8 }
};

__device__ u1byte qt2[2][16] =
{   { 11, 10, 5, 14, 6, 13, 9, 0, 12, 8, 15, 3, 2, 4, 7, 1 },
    { 4, 12, 7, 5, 1, 6, 9, 10, 0, 14, 13, 8, 2, 11, 3, 15 }
};

__device__ u1byte qt3[2][16] =
{   { 13, 7, 15, 4, 1, 2, 6, 14, 9, 11, 3, 0, 8, 5, 12, 10 },
    { 11, 9, 5, 1, 12, 3, 13, 14, 6, 4, 7, 15, 2, 0, 8, 10 }
};

u1byte qp(const u4byte n, const u1byte x)
{   u1byte  a0, a1, a2, a3, a4, b0, b1, b2, b3, b4;

    a0 = x >> 4; b0 = x & 15;
    a1 = a0 ^ b0; b1 = ror4[b0] ^ ashx[a0];
    a2 = qt0[n][a1]; b2 = qt1[n][b1];
    a3 = a2 ^ b2; b3 = ror4[b2] ^ ashx[a2];
    a4 = qt2[n][a3]; b4 = qt3[n][b3];
    return (b4 << 4) | a4;
};

#ifdef  Q_TABLES

__device__ u4byte  qt_gen = 0;
__device__ u1byte  q_tab[2][256];

#define q(n,x)  q_tab[n][x]

void gen_qtab(void)
{   u4byte  i;

    for(i = 0; i < 256; ++i)
    {
        q(0,i) = qp(0, (u1byte)i);
        q(1,i) = qp(1, (u1byte)i);
    }
};

#else

#define q(n,x)  qp(n, x)

#endif

#ifdef  M_TABLE

__device__ u4byte  mt_gen = 0;
__device__ u4byte  m_tab[4][256];

void gen_mtab(void)
{   u4byte  i, f01, f5b, fef;

    for(i = 0; i < 256; ++i)
    {
        f01 = q(1,i); f5b = ffm_5b(f01); fef = ffm_ef(f01);
        m_tab[0][i] = f01 + (f5b << 8) + (fef << 16) + (fef << 24);
        m_tab[2][i] = f5b + (fef << 8) + (f01 << 16) + (fef << 24);

        f01 = q(0,i); f5b = ffm_5b(f01); fef = ffm_ef(f01);
        m_tab[1][i] = fef + (fef << 8) + (f5b << 16) + (f01 << 24);
        m_tab[3][i] = f5b + (f01 << 8) + (fef << 16) + (f5b << 24);
    }
};

#define mds(n,x)    m_tab[n][x]

#else

#define fm_00   ffm_01
#define fm_10   ffm_5b
#define fm_20   ffm_ef
#define fm_30   ffm_ef
#define q_0(x)  q(1,x)

#define fm_01   ffm_ef
#define fm_11   ffm_ef
#define fm_21   ffm_5b
#define fm_31   ffm_01
#define q_1(x)  q(0,x)

#define fm_02   ffm_5b
#define fm_12   ffm_ef
#define fm_22   ffm_01
#define fm_32   ffm_ef
#define q_2(x)  q(1,x)

#define fm_03   ffm_5b
#define fm_13   ffm_01
#define fm_23   ffm_ef
#define fm_33   ffm_5b
#define q_3(x)  q(0,x)

#define f_0(n,x)    ((u4byte)fm_0##n(x))
#define f_1(n,x)    ((u4byte)fm_1##n(x) << 8)
#define f_2(n,x)    ((u4byte)fm_2##n(x) << 16)
#define f_3(n,x)    ((u4byte)fm_3##n(x) << 24)

#define mds(n,x)    f_0(n,q_##n(x)) ^ f_1(n,q_##n(x)) ^ f_2(n,q_##n(x)) ^ f_3(n,q_##n(x))

#endif

u4byte h_fun(const u4byte x, const u4byte key[])
{   u4byte  b0, b1, b2, b3;

#ifndef M_TABLE
    u4byte  m5b_b0, m5b_b1, m5b_b2, m5b_b3;
    u4byte  mef_b0, mef_b1, mef_b2, mef_b3;
#endif

    b0 = byte(x, 0); b1 = byte(x, 1); b2 = byte(x, 2); b3 = byte(x, 3);

    switch(k_len)
    {
    case 4: b0 = q(1, b0) ^ byte(key[3],0);
            b1 = q(0, b1) ^ byte(key[3],1);
            b2 = q(0, b2) ^ byte(key[3],2);
            b3 = q(1, b3) ^ byte(key[3],3);
    case 3: b0 = q(1, b0) ^ byte(key[2],0);
            b1 = q(1, b1) ^ byte(key[2],1);
            b2 = q(0, b2) ^ byte(key[2],2);
            b3 = q(0, b3) ^ byte(key[2],3);
    case 2: b0 = q(0,q(0,b0) ^ byte(key[1],0)) ^ byte(key[0],0);
            b1 = q(0,q(1,b1) ^ byte(key[1],1)) ^ byte(key[0],1);
            b2 = q(1,q(0,b2) ^ byte(key[1],2)) ^ byte(key[0],2);
            b3 = q(1,q(1,b3) ^ byte(key[1],3)) ^ byte(key[0],3);
    }
#ifdef  M_TABLE

    return  mds(0, b0) ^ mds(1, b1) ^ mds(2, b2) ^ mds(3, b3);

#else

    b0 = q(1, b0); b1 = q(0, b1); b2 = q(1, b2); b3 = q(0, b3);
    m5b_b0 = ffm_5b(b0); m5b_b1 = ffm_5b(b1); m5b_b2 = ffm_5b(b2); m5b_b3 = ffm_5b(b3);
    mef_b0 = ffm_ef(b0); mef_b1 = ffm_ef(b1); mef_b2 = ffm_ef(b2); mef_b3 = ffm_ef(b3);
    b0 ^= mef_b1 ^ m5b_b2 ^ m5b_b3; b3 ^= m5b_b0 ^ mef_b1 ^ mef_b2;
    b2 ^= mef_b0 ^ m5b_b1 ^ mef_b3; b1 ^= mef_b0 ^ mef_b2 ^ m5b_b3;

    return b0 | (b3 << 8) | (b2 << 16) | (b1 << 24);

#endif
};

#ifdef  MK_TABLE

#ifdef  ONE_STEP
__device__ u4byte  mk_tab[4][256];
#else
__device__ u1byte  sb[4][256];
#endif

#define q20(x)  q(0,q(0,x) ^ byte(key[1],0)) ^ byte(key[0],0)
#define q21(x)  q(0,q(1,x) ^ byte(key[1],1)) ^ byte(key[0],1)
#define q22(x)  q(1,q(0,x) ^ byte(key[1],2)) ^ byte(key[0],2)
#define q23(x)  q(1,q(1,x) ^ byte(key[1],3)) ^ byte(key[0],3)

#define q30(x)  q(0,q(0,q(1, x) ^ byte(key[2],0)) ^ byte(key[1],0)) ^ byte(key[0],0)
#define q31(x)  q(0,q(1,q(1, x) ^ byte(key[2],1)) ^ byte(key[1],1)) ^ byte(key[0],1)
#define q32(x)  q(1,q(0,q(0, x) ^ byte(key[2],2)) ^ byte(key[1],2)) ^ byte(key[0],2)
#define q33(x)  q(1,q(1,q(0, x) ^ byte(key[2],3)) ^ byte(key[1],3)) ^ byte(key[0],3)

#define q40(x)  q(0,q(0,q(1, q(1, x) ^ byte(key[3],0)) ^ byte(key[2],0)) ^ byte(key[1],0)) ^ byte(key[0],0)
#define q41(x)  q(0,q(1,q(1, q(0, x) ^ byte(key[3],1)) ^ byte(key[2],1)) ^ byte(key[1],1)) ^ byte(key[0],1)
#define q42(x)  q(1,q(0,q(0, q(0, x) ^ byte(key[3],2)) ^ byte(key[2],2)) ^ byte(key[1],2)) ^ byte(key[0],2)
#define q43(x)  q(1,q(1,q(0, q(1, x) ^ byte(key[3],3)) ^ byte(key[2],3)) ^ byte(key[1],3)) ^ byte(key[0],3)

void gen_mk_tab(u4byte key[])
{   u4byte  i;
    u1byte  by;

    switch(k_len)
    {
    case 2: for(i = 0; i < 256; ++i)
            {
                by = (u1byte)i;
#ifdef ONE_STEP
                mk_tab[0][i] = mds(0, q20(by)); mk_tab[1][i] = mds(1, q21(by));
                mk_tab[2][i] = mds(2, q22(by)); mk_tab[3][i] = mds(3, q23(by));
#else
                sb[0][i] = q20(by); sb[1][i] = q21(by);
                sb[2][i] = q22(by); sb[3][i] = q23(by);
#endif
            }
            break;

    case 3: for(i = 0; i < 256; ++i)
            {
                by = (u1byte)i;
#ifdef ONE_STEP
                mk_tab[0][i] = mds(0, q30(by)); mk_tab[1][i] = mds(1, q31(by));
                mk_tab[2][i] = mds(2, q32(by)); mk_tab[3][i] = mds(3, q33(by));
#else
                sb[0][i] = q30(by); sb[1][i] = q31(by);
                sb[2][i] = q32(by); sb[3][i] = q33(by);
#endif
            }
            break;

    case 4: for(i = 0; i < 256; ++i)
            {
                by = (u1byte)i;
#ifdef ONE_STEP
                mk_tab[0][i] = mds(0, q40(by)); mk_tab[1][i] = mds(1, q41(by));
                mk_tab[2][i] = mds(2, q42(by)); mk_tab[3][i] = mds(3, q43(by));
#else
                sb[0][i] = q40(by); sb[1][i] = q41(by);
                sb[2][i] = q42(by); sb[3][i] = q43(by);
#endif
            }
    }
};

#  ifdef ONE_STEP
#    define g0_fun(x) ( mk_tab[0][byte(x,0)] ^ mk_tab[1][byte(x,1)] \
                      ^ mk_tab[2][byte(x,2)] ^ mk_tab[3][byte(x,3)] )
#    define g1_fun(x) ( mk_tab[0][byte(x,3)] ^ mk_tab[1][byte(x,0)] \
                      ^ mk_tab[2][byte(x,1)] ^ mk_tab[3][byte(x,2)] )
#  else
#    define g0_fun(x) ( mds(0, sb[0][byte(x,0)]) ^ mds(1, sb[1][byte(x,1)]) \
                      ^ mds(2, sb[2][byte(x,2)]) ^ mds(3, sb[3][byte(x,3)]) )
#    define g1_fun(x) ( mds(0, sb[0][byte(x,3)]) ^ mds(1, sb[1][byte(x,0)]) \
                      ^ mds(2, sb[2][byte(x,1)]) ^ mds(3, sb[3][byte(x,2)]) )
#  endif

#else

#define g0_fun(x)   h_fun(x,s_key)
#define g1_fun(x)   h_fun(rotl(x,8),s_key)

#endif

/* The (12,8) Reed Soloman code has the generator polynomial

  g(x) = x^4 + (a + 1/a) * x^3 + a * x^2 + (a + 1/a) * x + 1

where the coefficients are in the finite field GF(2^8) with a
modular polynomial a^8 + a^6 + a^3 + a^2 + 1. To generate the
remainder we have to start with a 12th order polynomial with our
eight input bytes as the coefficients of the 4th to 11th terms.
That is:

  m[7] * x^11 + m[6] * x^10 ... + m[0] * x^4 + 0 * x^3 +... + 0

We then multiply the generator polynomial by m[7] * x^7 and subtract
it - xor in GF(2^8) - from the above to eliminate the x^7 term (the
artihmetic on the coefficients is done in GF(2^8). We then multiply
the generator polynomial by x^6 * coeff(x^10) and use this to remove
the x^10 term. We carry on in this way until the x^4 term is removed
so that we are left with:

  r[3] * x^3 + r[2] * x^2 + r[1] 8 x^1 + r[0]

which give the resulting 4 bytes of the remainder. This is equivalent
to the matrix multiplication in the Twofish description but much faster
to implement.

*/

#define G_MOD   0x0000014d

u4byte mds_rem(u4byte p0, u4byte p1)
{   u4byte  i, t, u;

    for(i = 0; i < 8; ++i)
    {
        t = p1 >> 24;   // get most significant coefficient

        p1 = (p1 << 8) | (p0 >> 24); p0 <<= 8;  // shift others up

        // multiply t by a (the primitive element - i.e. left shift)

        u = (t << 1);

        if(t & 0x80)            // subtract modular polynomial on overflow

            u ^= G_MOD;

        p1 ^= t ^ (u << 16);    // remove t * (a * x^2 + 1)

        u ^= (t >> 1);          // form u = a * t + t / a = t * (a + 1 / a);

        if(t & 0x01)            // add the modular polynomial on underflow

            u ^= G_MOD >> 1;

        p1 ^= (u << 24) | (u << 8); // remove t * (a + 1/a) * (x^3 + x)
    }

    return p1;
};

/* initialise the key schedule from the user supplied key   */

u4byte *set_key(const u4byte in_key[], const u4byte key_len)
{   u4byte  i, a, b, me_key[4], mo_key[4];

#ifdef Q_TABLES
    if(!qt_gen)
    {
        gen_qtab(); qt_gen = 1;
    }
#endif

#ifdef M_TABLE
    if(!mt_gen)
    {
        gen_mtab(); mt_gen = 1;
    }
#endif

    k_len = key_len / 64;   /* 2, 3 or 4 */

    for(i = 0; i < k_len; ++i)
    {
        a = in_key[i + i];     me_key[i] = a;
        b = in_key[i + i + 1]; mo_key[i] = b;
        s_key[k_len - i - 1] = mds_rem(a, b);
    }

    for(i = 0; i < 40; i += 2)
    {
        a = 0x01010101 * i; b = a + 0x01010101;
        a = h_fun(a, me_key);
        b = rotl(h_fun(b, mo_key), 8);
        l_key[i] = a + b;
        l_key[i + 1] = rotl(a + 2 * b, 9);
    }

#ifdef MK_TABLE
    gen_mk_tab(s_key);
#endif

    return l_key;
};

/* encrypt a block of text  */

#define f_rnd(i)                                                    \
    t1 = g1_fun(blk[1]); t0 = g0_fun(blk[0]);                       \
    blk[2] = rotr(blk[2] ^ (t0 + t1 + l_key[4 * (i) + 8]), 1);      \
    blk[3] = rotl(blk[3], 1) ^ (t0 + 2 * t1 + l_key[4 * (i) + 9]);  \
    t1 = g1_fun(blk[3]); t0 = g0_fun(blk[2]);                       \
    blk[0] = rotr(blk[0] ^ (t0 + t1 + l_key[4 * (i) + 10]), 1);     \
    blk[1] = rotl(blk[1], 1) ^ (t0 + 2 * t1 + l_key[4 * (i) + 11])

__device__ void encrypt(const u4byte in_blk[4], u4byte out_blk[4])
{   u4byte  t0, t1, blk[4];

    blk[0] = in_blk[0] ^ l_key[0];
    blk[1] = in_blk[1] ^ l_key[1];
    blk[2] = in_blk[2] ^ l_key[2];
    blk[3] = in_blk[3] ^ l_key[3];

    f_rnd(0); f_rnd(1); f_rnd(2); f_rnd(3);
    f_rnd(4); f_rnd(5); f_rnd(6); f_rnd(7);

    out_blk[0] = blk[2] ^ l_key[4];
    out_blk[1] = blk[3] ^ l_key[5];
    out_blk[2] = blk[0] ^ l_key[6];
    out_blk[3] = blk[1] ^ l_key[7];
};

/* decrypt a block of text  */

#define i_rnd(i)                                                        \
        t1 = g1_fun(blk[1]); t0 = g0_fun(blk[0]);                       \
        blk[2] = rotl(blk[2], 1) ^ (t0 + t1 + l_key[4 * (i) + 10]);     \
        blk[3] = rotr(blk[3] ^ (t0 + 2 * t1 + l_key[4 * (i) + 11]), 1); \
        t1 = g1_fun(blk[3]); t0 = g0_fun(blk[2]);                       \
        blk[0] = rotl(blk[0], 1) ^ (t0 + t1 + l_key[4 * (i) +  8]);     \
        blk[1] = rotr(blk[1] ^ (t0 + 2 * t1 + l_key[4 * (i) +  9]), 1)

__device__ void decrypt(const u4byte in_blk[4], u4byte out_blk[4])
{   u4byte  t0, t1, blk[4];

    blk[0] = in_blk[0] ^ l_key[4];
    blk[1] = in_blk[1] ^ l_key[5];
    blk[2] = in_blk[2] ^ l_key[6];
    blk[3] = in_blk[3] ^ l_key[7];

    i_rnd(7); i_rnd(6); i_rnd(5); i_rnd(4);
    i_rnd(3); i_rnd(2); i_rnd(1); i_rnd(0);

    out_blk[0] = blk[2] ^ l_key[0];
    out_blk[1] = blk[3] ^ l_key[1];
    out_blk[2] = blk[0] ^ l_key[2];
    out_blk[3] = blk[1] ^ l_key[3];
};

__global__ void twofish_enc(const u4byte in_blk[4], u4byte out_blk[4]) {
    int b = blockIdx.x*blockDim.x + threadIdx.x;
    encrypt(in_blk + 4*b, out_blk + 4*b);
}

__global__ void twofish_dec(const u4byte in_blk[4], u4byte out_blk[4]) {
    int b = blockIdx.x*blockDim.x + threadIdx.x;
    decrypt(in_blk + 4*b, out_blk + 4*b);
}

inline void HandleErrorImpl(cudaError error, const char* file, int line) {
    if (error != cudaSuccess) {
        printf("%s: %d %s", file, line, cudaGetErrorString(error));
        exit(1);
    }
}

int main(void) {
    char* key = "meingeheimespasswort";
    int threads = 512;
    int blocks = 12800;
    set_key((const u4byte*)key, strlen(key));

    u4byte* in_blk;
    HandleError(cudaMalloc(&in_blk, 16*threads*blocks));
    HandleError(cudaMemset(in_blk, 0, 16*threads*blocks));

    u4byte* out_blk;
    HandleError(cudaMalloc(&out_blk, 16*threads*blocks));

    char *test1 = (char*)malloc(16*threads*blocks);
    HandleError(cudaMemcpy(test1, in_blk, 16*threads*blocks, cudaMemcpyDeviceToHost));

    printf("First 16 bytes of input data: ");
    for (int i=0; i<16; i++) {
        printf("%d, ", test1[i]);
    }

    cudaEvent_t start, stop;
    HandleError(cudaEventCreate(&start));
    HandleError(cudaEventCreate(&stop));
    HandleError(cudaEventRecord( start, 0 ));

    twofish_enc<<<blocks,threads>>>(in_blk, out_blk);
    twofish_dec<<<blocks,threads>>>(out_blk, in_blk);

    HandleError(cudaGetLastError());

    HandleError(cudaEventRecord( stop, 0 ));
    HandleError(cudaEventSynchronize( stop ));
    float elapsedTime;
    HandleError(cudaEventElapsedTime( &elapsedTime, start, stop ));

    printf("\nPerformance: %.6f MB/sec\n", 2*((16*threads*blocks)/1048576)/(elapsedTime/1000));

    HandleError(cudaEventDestroy( start ));
    HandleError(cudaEventDestroy( stop ));

    char *test2 = (char*)malloc(16*threads*blocks);
    HandleError(cudaMemcpy(test2, in_blk, 16*threads*blocks, cudaMemcpyDeviceToHost));

    printf("After decryption: ");
    for (int i=0; i<16; i++) {
        printf("%d, ", test2[i]);
    }

    free(test1);
    free(test2);

    HandleError(cudaFree(in_blk));
    HandleError(cudaFree(out_blk));

    return 0;
}
Brauche ich die Schlüssel-Initialisierung in der main jetzt noch in der Form, wie ich es habe?

Und in der Ausgabe kriege ich in Zeile 521 noch einen Fehler:

snapshot41.png

"Too many resources requested for launch"

[Edit]
Habe beim googlen noch rausgefunden, dass wir scheinbar zu viele Register und/oder zu viel shared memory ansprechen wollen. ???
 
Zuletzt bearbeitet:
Zurück