Latest Cudaminer release brings massive hashrate increase to Nvidia GPUs

Latest Cudaminer release with improved Nvidia Scrypt mining performance

The latest release of Cudaminer (18 December 2013) will squeeze more performance out of Nvidia cards. Early testers are reporting an increase in performance for as much as 40 percent from the previous version.

CUDA Core is the term Nvidia uses to call the shaders in its GPUs. The Cudaminer is designed specifically for Nvidia GPU mining with Cuda accelerated mining application for Litecoin and Scrypt based altcoins. There would be a noticeable speed increase compared to OpenCL based miners.

By default, it will detect and automatically use all nVidia GPUs found in the system, but can also be set manually by the user. I have managed to find one Nvidia GPU to do some testing on the latest Cudaminer release and i see improvement in hashrate of around 10 percent. The old GT 630 reached around 39KHash/sec without overclocking, which proves that it really works. Compute 3.0 devices will see a higher speed boost. So if you have any Nvidia GPU lying around, it is definately worth it to give this a try.

Some numbers on the improvements:

GTX 640: 89 kHash/sec (formerly 65 kHash/s)
GT 750M: 80 kHash/sec (formerly 55 kHash/s)
GTX 660Ti: 250 kHash/sec (formerly 186 kHhash/s)
GTX 780Ti: 500 kHash/sec (formerly 450 kHash/s non overclocked)

Cudaminer

Download

Cudaminer-2013-12-18.zip
or look for the latest version from the official link.

Command line options

–no-autotune

disables the built-in autotuning feature for maximizing CUDA kernel efficiency and uses some heuristical guesswork, which might not be optimal.

–devices

[-d] gives a list of CUDA device IDs to operate on. Device IDs start counting from 0!

–launch-config

[-l] specify the kernel launch configuration per device. This replaces autotune or heuristic selection. You can pass the strings auto or just a kernel prefix like L or F or K or T to autotune for a specific card generation or a kernel prefix plus a lauch configuration like F28x8 if you know what kernel runs best (from a previous autotune).

–interactive

[-i] list of flags (0 or 1) to enable interactive desktop performance on individual cards. Use this to remove lag at the cost of some hashing performance. Do not use large launch configs for devices that shall run in interactive mode – it’s best to use autotune!

–texture-cache

[-C] list of flags (0 or 1 or 2) to enable use of the texture cache for reading from the scrypt scratchpad. 1 uses a 1D cache, whereas 2 uses a 2D texture layout. Cached operation has proven to be slightly faster than noncached operation on most GPUs.

–single-memory

[-m] list of flags (0 or 1) to make the devices allocate their scrypt scratchpad in a single, consecutive memory block. On Windows Vista, 7/8 this may lead to a smaller memory size being used. When using the texture cache this option is implied.

–hash-parallel

[-H] scrypt also has a small SHA256 component to it:

0 hashes this single threaded on the CPU.
1 to enable multithreaded hashing on the CPU.
2 offloads everything to the GPU (default)

Example of command line options:

-H 2 -d 0 -i 1 -l F16x2 -C 1 -m 0 -o stratum+tcp://coinotron.com:3334 -O [WORKERNAME]:[PASSWORD]

[-H] The option -H 2 uses the GPU for all hashing work, which puts very little load on the CPU. With this latest version, the computer is still very responsive even though the mining activity is running on the background.
[-d,-i] I instruct cudaminer to use device 0 which is the only GPU on the motherboard. Because I have the display attached to device 0, I set that device to run in interactive mode so it is fully responsive for desktop use while mining.
[-l] You can set this to auto if you want cudaminer to perform autotune. I have set it to use kernel launch configuration F16x2 (for Fermi) and in non-interactive mode.
[-C] I turn on the use of the texture cache to 1D.
[-o,-O] The given -o/-O settings mine on the coinotron pool using the stratum protocol.

Additional notes

This tool is for Litecoin and other Scrypt based altcoins only.
Compute 1.0 through 1.3 devices seem to run faster on Windows XP or Linux because these OS’es use a more efficient driver model.
The 64bit cudaminer sometimes mines a bit slower than the 32 bit binary (increased register pressure, as pointers take two registers in a 64 bit CUDA build!). Try both versions and compare!
This code should be fine on nVidia GPUs ranging from compute capability 1.1 up to compute capability 3.5.
To see what autotuning does, enable the debug option (-D) switch. You will get a table of kHash/s for a variety of launch configurations. You may only want to do this when running on a single GPU, otherwise the autotuning output of multiple cards will get all mixed up.
The December 18th milestone transitions cudaminer to CUDA 5.5, which makes it require newer nVidia drivers unfortunately. However users of Kepler devices will see a significant speed boost of 30% for Compute 3.0 devices and around 10% for Compute 3.5 devices.

About CUDA Kernels

CUDA kernels do the computation. Which one we select and in which configuration it is run greatly affects performance. CUDA kernel
launch configurations are given as a character string, e.g. F16x2

prefix blocks x warps

Available kernel prefixes are:

L – Legacy cards (compute 1.x)
F – Fermi cards (Compute 2.x)
S – Kepler cards (currently compiled for Compute 1.2) – formerly best for Kepler
K – Kepler cards (Compute 3.0) – based on Dave Andersen’s work. Now best for Kepler.
T – Titan, GTX 780 and GK208 based cards (Compute 3.5)
X – Experimental kernel. Currently requires Compute 3.5

Examples:

L27x3 is a launch configuration that works well on GTX 260
F28x4 is a launch configuration that works on Geforce GTX 460
K290x2 is a launch configuration that works on Geforce GTX 660Ti
T30x16 is a launch configuration that works on GTX 780Ti.

You should wait through autotune to see what kernel is found best for your current hardware configuration. You can also override the autotune’s automatic device generation selection, e.g. pass

-l L
or
-l F
or
-l K
or
-l T

in order to autotune the Legacy, Fermi, Kepler or Titan kernels overriding the automatic selection.

Update 2/3/2014 – Check out the latest update of cudaMiner with support for Maxwell architecture. This will be the biggest performance change on mining with Nvidia cards.

Guest

December 22, 2013 at 10:36 pm

Could you please post the configuration for the GT750M card, with which you got that 80khps, or post a link to where you got that number from? Thanks in advance.

Karush Avagyan

February 8, 2014 at 3:45 pm

I get 614 kHash/sec on my Gigabyte GTX 780TI OC

Dennis Franssen

March 2, 2014 at 8:03 am

659 Kh/s on the Gainward Phantom 780 Ti

Jebus

May 16, 2014 at 7:48 am

Getting 695 Kh/s on a slightly overclocked 780 TI classified with some command line tweaking.