Monday, 17 August 2015

Friday, 7 August 2015

Review of Papers on GPU Game Tree Search

  • It looks like Monte Carlo Tree Search gives the best speedups compared to an CPU implementation.
  • The Node Based Parallel Search is an hybrid approach that offloads computational tasks to the GPU.
  • MiniMax search can be parallelized on the GPU, but is inferior to AlphaBeta.
  • Speedup of parallel AlphaBeta implementations depend on the branching factor of the Game.

So far i have found nothing about an implementation that makes use of the recursive features of newer architectures.

Thursday, 30 July 2015

Papers on GPU Game Tree Search

Linklist of papers related to Game Tree Seach on GPUs for two-player zero-sum games...

Parallel Game Tree Search on SIMD Machines 
Holger Hopp and Peter Sanders (1995), citeseerx pdf
Note - Implementation of YBWC, parallel AlphaBeta, on an 16K SIMD machine for an synthetic game tree.

Efficiency of Parallel Minimax Algorithm for Game Tree Search
Plamenka Borovska, Milena Lazarova (2007), citeseerx pdf
Note - Efficiency of AlphaBeta search for 4x4 TicTacToe via MPI and OpenMP on an CPU cluster. 

GPU-Accelerated program to play Go
Zachary Clifford (2009), pdf
Note - Implementation of MCTS for Go. 

Playing Zero Sum Games on the GPU
Avi Bleiweiss (2010), pdf
Note - Multliple Game Tree Searches on GPU.

Parallel Minimax Tree Searching on GPU
Kamil Rocki and Reiji Suda (2010), pdf
Note - Implementation of MinixMax for Reversi.

Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks
Damian Sulewski (2011), link to pdf
Note - Chapter 4 - GPUSSD-BFS - A GPU and SSD supported Breadth-First Search.

Parallel Game Tree Search Using GPU
L’ubomír Lackovi (2011), pdf
Note - Implementation of parallel search for Czech Draughts.

Parallel Monte Carlo Tree Search on GPU
Kamil Rocki and Reiji Suda (2011) pdf
Note - Implementation of MCTS for Reversi.

Parallel alpha-beta algorithm on the GPU
Damjan Strnad and Nikola Guid (2011), scholar google pdf
Note - Implementation of PV-Split, parallel AlphaBeta, for Reversi.

A Node-based Parallel Game Tree Algorithm Using GPUs
Liang Li, Hong Liu, Peiyu Liu, Taoying Liu, Wei Li, Hao Wang (2012) IEEE
Note - Implementation of node-based parallelism for Connect6.

Parallel UCT Search on GPUs
Nicolas A. Barriga, Marius Stanescu, Michael Buro (2014), IEEE
Note - Implementation of MCTS with UCT for 8x8 Ataxx.

A Review on Parallelization of Node based Game Tree Search Algorithms on GPU
Ms. Rutuja U. Gosavi, Prof. Payal S. Kulkarni (2014), pdf

Parallelization of Node Based Game Tree Search Algorithm on GPU
Ms. Rutuja U. Gosavi, Mrs. Archana S. Vaidya (2015), pdf
Note - Implementation of node-based parallelism for Connect4/Connect6.

Friday, 10 July 2015

AMD FirePro S9170 with 32 GB of memory

AMD comes up with world's first server GPU with 32 GB on board....
“By combining the new AMD FirePro S9170 server GPU with our G250 Series of 2U eight GPU systems, we are able to deliver HPC servers with up to 256GB of GPU memory,” said Alex Liu, product marketing executive of the GIGABYTE Network & Comm. Division. “This unprecedented density of GDDR5 memory will help our customers offer very attractive machines to industries processing large datasets.”

Here the complete feature list:

Nvidia Starts Supplying Open-Source Hardware Reference Headers

"There's another step forward today in NVIDIA's open-source/Linux hardware support! NVIDIA will begin supplying hardware reference headers for the Nouveau DRM driver"
Full article:

Nvidia's Kepler and Maxwell with OpenCL 1.2 support

Looks like Nvidia added OpenCL 1.2 support for Kepler and Maxwell devices in May 2015.


Monday, 1 June 2015

OpenSource OpenCL Implementation on Linux

Recently searched for alternatives to Nvidia's proprietary GPU drivers (with OpenCL runtime) and found the Gallium Project (part of Mesa).
srdja@me:~$ dpkg -l |grep OpenCL
ii  clinfo                       Query OpenCL system information
ii  libclc-dev                OpenCL C language implementation - development files 
ii  libclc-ptx                 OpenCL C language implementation - ptx support
ii  libclc-r600               OpenCL C language implementation - r600 support
ii  mesa-opencl-icd      free implementation of the OpenCL API -- ICD runtime
ii  ocl-icd-libopencl1   Generic OpenCL ICD Loader
Current status:

...looks like OpenCL 1.0 and 1.1 for Nvidia G80 and AMD R600/R700/R800/R900 are work in progress.

Sunday, 3 May 2015

Overview of some GPGPU frameworks

  • BrookGPU Compiles from the C like Brook stream processing language to OpenGL or Directx.
  • C++ AMP Open standard by Microsoft which extends the C++ language and library to access hardware accelerators like GPUs. 
  • Close to Metal GPGPU programming interface by ATI.
  • CodeXL Software IDE by AMD with support for OpenCL debugging and profiling. 
  • CUDA Framework by Nvidia with support for C, C++, Fortran and third party language wrappers.
  • DirextX/DirectCompute Graphics and compute API by Microsoft. 
  • HSA Hardware architecture with features like unified virtual address space for all compute devices and an intermediate language (HSAIL). 
  • Metal Graphics and compute API by Apple for iOS and OS X.
  • OpenACC Set of compiler directives to access hardware accelerators.
  • OpenCL Framework with C like language for heterogeneous computing, specified by the Khronos Group. Implemented for APUs, CPUs, FPGAs, GPUs by Altera, AMD, ARM, IBM, Intel, Nvidia, etc.
  • OpenGL Graphics API specified by the Khronos Group with support for GLSL.
  • OpenHMPP Set of compiler directives to access hardware accelerators like GPUs in compilers by CAPS and PathScale.
  • PTX Intermediate, assembly like, language for Nvidia CUDA devices.
  • Renderscript General purpose compute API for Android devices.
  • SPIR-V Intermediate language, to be used in OpenCL 2.1 and Vulkan .
  • Vulkan Graphics and compute API specified by Khronos Group, planed successor to OpenGL.
* edit on 2015-07-28 *

Monday, 30 March 2015

How Computer Chess Engines could run on GPUs

  1. One SIMD Unit - One Board

    To avoid thread divergence in a Warp, resp. Wavefront, the engine could couple, for example, 32 or 64 Work-Items of one Work-Group to work together on the same chess position. For instance, to generate moves, sort a move list or do an board evaluation in parallel. A move generator of such an Work-Group could operate over pieces, directions, or simply 64 squares in parallel. But in any of these cases current GPU SIMD units will 'waste' some instructions compared to the more efficient, sequential, processing of an CPU.
  2. Use of Local Memory* instead of Global Memory

    The more sequential threads are coupled into one Work-Group to work on one chess position in parallel, the more Local Memory* per Work-Group could be available to store a move list, or a move list stack. By the use of faster Local Memory ,less Warps (resp. Wavefronts) are in need to hide Global Memory latency.
  3. Hundreds of Work-Groups instead Thousands of Threads

    YBWC is a parallel game tree search algorithm used in nowadays chess engines, but the more workers the algorithm runs, the less efficient he performs.
    So, by coupling sequential operating threads into one Work-Group, to work on one chess position in parallel, we lower the total number of workers and increase efficiency of the parallel search.

* Local Memory as OpenCL term is translated to Shared Memory as Nvidia Cuda term.

Wednesday, 25 March 2015

Nvidia's next gen Pascal GPU with 2.5D stacked HBM in 2016

Nvidia plans to introduce 2.5D stacked High Bandwidth Memory with next gen architecture 'Pascal' in 2016.


Rumours about AMD GPU 'Fiji' with 2.5D stacked High Bandwidth Memory

According to Fudzilla and Wikipedia the next gen AMD GCN 1.3 gpu chip, 'Fiji', is going to use 2.5D stacked HBM, High Bandwidth Memory.

Rumours are about 8GB of memory with a bus width of 4096 bit, resulting in a bandwidth of 640 GB/s for the upcoming (2nd half 2015) AMD R9 390X, twice as much as the R9 290X.

For gpuchess the latency of Global Memory (SGRAM) is of interest. Currently hundreds of clock cycles. If significantly decreased, the program design would become more relaxed.

* update 2015-07-10 *
Meanwhile AMD anounced the Fiji Pro and Fiji XT GPUs, the latter with 4096 stream processors, 4GB of RAM,  a bus width of 4096 bit and a memory bandwidth of 512 GB/s.


Tuesday, 10 March 2015

Zeta Chess Projects

On this blog you will find information about my computer chess projects...

ZetaDva, a classic, amateur level, computer chess program.

ZetaCL (Zeta OpenCL Chess), an experimental chess engine written in OpenCL, therefore able to run on GPUs.
ZetaVintage, a chess engine for the Atari 800 XE (6502 processor).

ZetaAI, an attempt to make use of ANNs in computer chess.

ZetaCL - a Chess Engine written in OpenCL

ZetaCL, (Zeta OpenCL Chess) is an experimental open source chess engine written in OpenCL by Srdja Matovic. The engine has been in development since 2010 and was first released June 23, 2011 under the GNU GPL. The latest version is 098c (August 2013). Zeta supports only some basic commands of the Chess Engine Communication Protocol (Winboard/Xboard). Main features are the use of Quad-Bitboards, a parallel best-first minimax search and its ability to run on a GPU. The ZetaCL project is currently inactive.


Zeta is written in C99 (host) and OpenCL 1.1 (device). The host handles the IO, but all chess related computations are done in OpenCL on the device. Zeta 097 and 098 were designed to run thousands of independent threads on a GPU, therefore they store the expanded search tree via an MCαβ search in memory .

NPS and Playing Strength

  • Zeta 098c on Nvidia GeForce GTX 480, ca. 5 Mnps, est. 2000 Elo
  • Zeta 097x on Nvidia GeForce GTX 480, ca. 5 Mnps, est. 1800 Elo
  • Zeta 098c on AMD Radeon R9 290, ca. 3.2 Mnps
  • Zeta 097x on AMD Radeon HD 7750, ca. 800 Knps

* edit on 2015-07-31 *

Tuesday, 22 April 2014

ZetaCL - v098 revisited

To make it short, ZetaCL v097 and v098 make excessive use of the slower Global Memory, therefore the computed nodes per second scale linear with the memory bandwidth and not with the amount of cores or their clock rates.

Friday, 4 April 2014

Alternative Game Tree Search Algorithms

Here some alternative algorithms to plain MiniMax AlphaBeta search...

Why Computer Chess Engines do not run on GPUs

  1. SIMT architecture of GPUs

    GPUs consists of tens to hundreds of SIMD or Vector Units that process multiple threads in multiple Warps or Wavefronts in SIMT fashion.
  2. Memory architecture of GPUs

    To hide latency of Global Memory (SGRAM) GPUs can run multiple Warps or Wavefronts and prefer to do computation by the use of Local or Private Memory. So, the more Work-Items and Work-Groups you run to hide latency, the less Local and Private Memory per thread will be available.
  3. Thousands of threads on GPUs

    MiniMax search with Alpha-beta pruning performs best serial, not parallel.

* edit on 2015-03-30 *

OpenCL and GPU Glossary

  • Compute Unit => OpenCL term for a set of Processing Elements, e.g. one core of an CPU, one single or set of SIMD units of a GPU
  • Core => Term for one hardware element that can compute a program, e.g. one core of an CPU
  • CPU => Central Processing Unit
  • CUDA =>  Compute Unified Device Architecture, GPGPU Framework by Nvidia
  • CUDA Core => Nvida term for one Processing Element
  • FLOP => Floating Point Operation
  • Functional Unit => Nvidia term for one Processing Element
  • Global Memory => OpenCL term for memory accessible by all Work-Items on all Compute Units, e.g. RAM of CPU or GPU
  • GPU => Graphics Processing Unit
  • GPGPU => General Purpose computing on Graphics Processing Units
  • Kernel => OpenCL term for a program executed on an OpenCL device
  • Local Memory => OpenCL term for memory accessible by all Work-Items of an Work-Group running on one Compute Unit
  • Local Memory => Nvidia term for OpenCL Private Memory, but mapped to Global Memory instead of registers
  • MIMD => Multiple Instruction Multiple Data
  • Multiprocessor => Nvidia term for Compute Unit
  • OpenCL => Open Computing Language, a programming language standard specified by the Khronos Group
  • Private Memory => OpenCL term for memory accessible by a single Work-Item, e.g. registers of a CPU or GPU
  • Processing Element => OpenCL term for one core of an Compute Unit, for example a 16 wide SIMD Unit is usually counted with 16 cores, Special Function Units are usually not counted as core, but sometimes added to the total instruction throughput callout
  • Shared Memory => Nvidia term for OpenCL Local Memory
  • SIMD => Single Instruction Multiple Data
  • SIMT => Single Instruction Multiple Thread
  • Streamcore => AMD and ATI term for one Processing Element
  • Thread => Generally, a sequence of instructions that may execute in parallel with others
  • Thread => Nvidia term for one Work-Item
  • Thread Block => Nvidia term for a Work-Group
  • Warp => Nvidia term for the preferred Work-Group-Size-Multiple of an Kernel *
  • Wavefront => AMD term for Warp **
  • Work-Group => OpenCL term for a Group of Work-Items to be computed on one Compute Unit
  • Work-Item => OpenCL term for one single thread in one Work-Group

* Warp => Current Nvidia GPUs process 32 threads in one Warp in lock-step and are able to run multiple Warps per Compute Unit (SIMT).

** Wavefront => Current AMD GPUs process 64 threads in one Wavefront in lock-step and are able to run multiple Wavefronts per Compute Unit (SIMT).

* edit on 2015-07-28 *

Thursday, 3 April 2014

Other GPU Chess Projects

ZetaCL - Milestones

Here an overview of what happened before....

Zeta 0900 to 0910

  • Tested 32 bit 0x88 and 64 bit Magic Bitboard Move Generator
  • Ported Heuristics, the Evaluation Function, from CPU engine 'ZetaDva' (~2000 Elo) to OpenCL

Zeta 0915 to 0918

  • 64 bit Magic Bitboard Move Generator running
  • AlphaBeta search algorithm with 'SPPS'-parallelization running 128 threads on one Compute Unit of the GPU

Zeta 093 to 096

  • Tested Monte Carlo Tree Search without UCT across multiple Compute Units of the GPU
  • Tested LIFO-Stack based load balancing for AlphaBeta search on one Compute Unit of the GPU
  • Tested the 'Nagging' and 'Spam' parallelization approach for AlphaBeta search on one Compute Unit of the GPU
  • Tested 'RBFMS', Randomized BestFirstMiniMax Search, a parallelized version of BestFirstMinixMax, across multiple Compute Units of the GPU

Zeta 097a to z

  • Implementation of an BestFirstMiniMax search algorithm with UCT parameters for parallelization

Zeta 098a to c

  • Improved heuristics, partly ported from the Stockfish chess engine
  • AutoConfig for OpenCL devices
  • Parameter tuning 

Zeta 099

  • to be continued ???

ZetaCL - Source Code

ZetaCL and ZetaDva support only some basic Xboard protocol commands and some users have reported problems with the configuration and interface of the last Zeta versions.‭ ‬So i will publish the source code again when these parts are more user friendly designed and tested for Windows Chess-GUIs like Winboard or Arena.