Monday, 30 March 2015

How Computer Chess Engines could run on GPUs

  1. One SIMD Unit - One Board

    To avoid thread divergence in a Warp, resp. Wavefront, the engine could couple, for example, 32 or 64 Work-Items of one Work-Group to work together on the same chess position. For instance, to generate moves, sort a move list or do an board evaluation in parallel. A move generator of such an Work-Group could operate over pieces, directions, or simply 64 squares in parallel. But in any of these cases current GPU SIMD units will 'waste' some instructions compared to the more efficient, sequential, processing of an CPU.
  2. Use of Local Memory* instead of Global Memory

    The more sequential threads are coupled into one Work-Group to work on one chess position in parallel, the more Local Memory* per Work-Group could be available to store a move list, or a move list stack. By the use of faster Local Memory less Warps, resp. Wavefronts, are in need to hide Global Memory latency.
  3. Hundreds of Work-Groups instead Thousands of Threads

    YBWC is a parallel game tree search algorithm used in nowadays chess engines, but the more workers the algorithm runs the less efficient he performs.
    So, by coupling sequential operating threads into one Work-Group, to work on one chess position in parallel, we lower the total number of workers and increase efficiency of the parallel search.

* Local Memory as OpenCL term is translated to Shared Memory as Nvidia Cuda term.
--
Srdja

Wednesday, 25 March 2015

Nvidia's next gen Pascal GPU with 2.5D stacked HBM in 2016

Nvidia plans to introduce 2.5D stacked High Bandwidth Memory with next gen architecture 'Pascal' in 2016.

http://www.fudzilla.com/news/graphics/37294-pascal-uses-2-5d-hbm-memory

http://www.fudzilla.com/news/graphics/37292-nvidia-pascal-comes-in-2016

--
Srdja

Rumours about AMD GPU 'Fiji' with 2.5D stacked High Bandwidth Memory

According to Fudzilla and Wikipedia the next gen AMD GCN 1.3 gpu chip, 'Fiji', is going to use 2.5D stacked HBM, High Bandwidth Memory.

Rumours are about 8GB of memory with a bus width of 4096 bit, resulting in a bandwidth of 640 GB/s for the upcoming (2nd half 2015) AMD R9 390X, twice as much as the R9 290X.

For gpuchess the latency of Global Memory (SGRAM) is of interest. Currently hundreds of clock cycles. If significantly decreased the program design would become more relaxed.

--
Srdja

Tuesday, 10 March 2015

Zeta Chess Projects

On this blog you will find information about my computer chess projects...

Zeta Dva, a classic, amateur level, computer chess program.

Zeta CL (Zeta OpenCL Chess), an experimental chess engine written in OpenCL, therefore able to run on GPUs.
 
Zeta Vintage, a chess engine for the Atari 800 XE (6502 processor).

Zeta AI, an attempt to make use of ANNs in computer chess.

--
Srdja

ZetaCL - a Chess Engine written in OpenCL

ZetaCL, (Zeta OpenCL Chess) is an experimental open source chess engine written in OpenCL by Srdja Matovic. The engine has been in development since 2010 and was first released June 23, 2011 under the GNU GPL. The latest version is 098c (August 2013). Zeta supports only some basic commands of the Chess Engine Communication Protocol (Winboard/Xboard). Main features are the use of Quad-Bitboards, a parallel best-first minimax search and its ability to run on a GPU. The ZetaCL project is currently inactive.



Details

Zeta is written in C99 (host) and OpenCL 1.1 (device). The host handles the IO, but all chess related computations are done in OpenCL on the device. Zeta 097 and 098 were designed to run thousands of independent threads on a GPU, therefore they store the expanded search tree via an MCαβ search in memory .

NPS and Playing Strength

  • Zeta 098c on Nvidia GTX 480, ca., 5 Mnps, about 2000 Elo
  • Zeta 097x on Nvidia GTX 480, ca. 5 Mnps, about 1800 Elo
  • Zeta 098c on AMD HD R290, ca. 3.2 Mnps
  • Zeta 097x on AMD HD 7750, ca. 800 Knps


Source: http://chessprogramming.wikispaces.com/Zeta (03/10/15)

Tuesday, 22 April 2014

ZetaCL - v098 revisited

To make it short, ZetaCL v097 and v098 make excessive use of the slower Global Memory, therefore the computed nodes per second scale linear with the memory bandwidth and not with the amount of cores or their clock rates.

--
Srdja

Friday, 4 April 2014

Alternative Game Tree Search Algorithms

Here some alternative algorithms to plain MiniMax AlphaBeta search...


--
Srdja

Why Computer Chess Engines do not run on GPUs

  1. SIMT architecture of GPUs

    GPUs consists of tens to hundreds of SIMD or Vector Units that process multiple threads in multiple Warps or Wavefronts in SIMT fashion.
  2. Memory architecture of GPUs

    To hide latency of Global Memory (SGRAM) GPUs can run multiple Warps or Wavefronts and prefer to do computation by the use of Local or Private Memory. So, the more Work-Items and Work-Groups you run to hide latency, the less Local and Private Memory per thread will be available.
  3. Thousands of threads on GPUs

    MiniMax search with Alpha-beta pruning performs best serial, not parallel.

# last edit on 03/30/15

--
Srdja

OpenCL and GPU Glossary

  • Compute Unit => OpenCL term for a set of Processing Elements, e.g. one core of an CPU, one single or set of SIMD units of a GPU
  • Core => Term for one hardware element that can compute a program, e.g. one core of an CPU.
  • CPU => Central Processing Unit
  • CUDA Core => Nvida term for one Processing Element
  • FLOP => Floating Point Operation
  • Functional Unit => Nvidia term for one Processing Element
  • Global Memory => OpenCL term for memory accessible by all Work-Items on all Compute Units, e.g. RAM of CPU or GPU
  • GPU => Graphics Processing Unit
  • GPGPU => General Purpose Graphics Processing Unit
  • Kernel => OpenCL term for a program executed on an OpenCL device
  • Local Memory => OpenCL term for memory accessible by all Work-Items of an Work-Group running on one Compute Unit
  • Local Memory => Nvidia term for OpenCL Private Memory, but mapped to Global Memory instead of registers
  • MIMD => Multiple Instruction Multiple Data
  • Multiprocessor => Nvidia term for Compute Unit
  • OpenCL => Open Computing Language, a programming language standard specified by the Khronos Group
  • Private Memory => OpenCL term for memory accessible by a single Work-Item, e.g. registers of a CPU or GPU
  • Processing Element => OpenCL term for one core of an Compute Unit, for example a 16 wide SIMD Unit is usually counted with 16 cores, Special Function Units are usually not counted as core, but sometimes added to the total instruction throughput callout
  • Shared Memory => Nvidia term for OpenCL Local Memory
  • SIMD => Single Instruction Multiple Data
  • SIMT => Single Instruction Multiple Thread
  • Streamcore => AMD and ATI term for one Processing Element
  • Thread => Generally, a sequence of instructions that may execute in parallel with others
  • Thread => Nvidia term for one Work-Item
  • Thread Block => Nvidia term for a Work-Group
  • Warp => Nvidia term for the preferred Work-Group-Size-Multiple of an Kernel *
  • Wavefront => AMD term for Warp **
  • Work-Group => OpenCL term for a Group of Work-Items to be computed on one Compute Unit
  • Work-Item => OpenCL term for one single thread in one Work-Group

* Warp => Current Nvidia GPUs process 32 threads in one Warp in lock-step and are able to run multiple Warps per Compute Unit (SIMT).

** Wavefront => Current AMD GPUs process 64 threads in one Wavefront in lock-step and are able to run multiple Wavefronts per Compute Unit (SIMT).


# last edit on 04/04/15

Thursday, 3 April 2014

Other GPU Chess Projects

--
Srdja

ZetaCL - Milestones

Here an overview of what happened before.... 

 Zeta 0900 to 0910

  •     Tested 32 bit 0x88 and 64 bit Magic Bitboard Move Generator
  •     Ported Heuristics, the Evaluation Function, from CPU engine 'ZetaDva' (~2000 Elo) to OpenCL

Zeta 0915-0918

  •     64 bit Magic Bitboard Move Generator running
  •     AlphaBeta search algorithm with 'SPPS'-parallization running 128 threads on one Compute Unit of the GPU

Zeta 093 to 096

  •     Tested Monte Carlo Tree Search without UCT across multiple Compute Units of the GPU
  •     Tested LIFO-Stack based load balancing for AlphaBeta search on one Compute Unit of the GPU
  •     Tested the 'Nagging' and 'Spam' parallelization approach for AlphaBeta search on one Compute Unit of the GPU
  •     Tested 'RBFMS', Randomized BestFirstMiniMax Search, a parallelized version of BestFirstMinixMax, across multiple Compute Units of the GPU

Zeta 097a-z

  •     Implementation of an BestFirstMiniMax search algorithm with UCT parameters for parallelization

Zeta 098a-c

  •     Improved heuristics, partly ported from the Stockfish chess engine
  •     AutoConfig for OpenCL devices
  •     Parameter tuning

ZetaCL - Source Code

ZetaCL and ZetaDva support only some basic Xboard protocol commands and some users have reported problems with the configuration and interface of the last Zeta versions.‭ ‬So i will publish the source code again when these parts are more user friendly designed and tested for Windows Chess-GUIs like Winboard or Arena.

--
Srdja