Discussion in 'Article Discussion' started by bit-tech, 14 Sep 2018 at 11:01.
Interested in whether the AI cores can be trained to do really good upscaling. So we've had upscaling for ever but the new consoles show there's a significant difference in achievable quality - they upscale everything to 4k and do a surprisingly good job at it, it's not as good as real 4k but looks a lot better then you'd expect. In the same way the AI cores can be trained to do better AA, they should be able to be trained to do better upscaling. So for example you'd take a game, give the AI cores the un-anti aliased 2540*1440 resolution, and the super sampled 4k image and get them to learn how best to turn one into the other for a particular game.
Then I can buy a 2070 (or for me more likely a 3070 later) and a 4k monitor. If it's an older game I play at 4k, if it's a new game I play at 2540 upscaled and I still get great fps and details.
*edit* ah I misunderstood. That's what DLSS is doing - you want 4k, it renders at a lower res and upscales so well it looks like 4k + TAA.
There is also a DLSS 2 which I guess actually renders at 4k but looks like 4K + 64 times SSAA.
Tbh if the DLSS works, it's probably more important then the ray tracing.
Going on Nvidia's announcement of what games/developers are on-board with DLSS it seems it's one of those things that has to be coded for on a game by game basis.
Back to the lifting of the NDA on the technical details of Turing, why if Nvidia said they found for every 100FP ops there's 36 INT ops why on earth have a 50/50 split of dedicated FP/INT 'cores'?
Also Nvidia's marketing BS, they're by no means the only one guilty of it, does a really good job of turning simple and/or downgraded hardware into something that sounds more palatable, I mean seriously a Warp scheduler is nothing more than a low queue depth pre-emptive hardware scheduler intended to get around the fact that you can't rearrange workloads once they've been submitted to their software base scheduler.
And I'd bet my bottom dollar that the RT 'cores' are exactly the same as the CUDA 'cores' only segregated at the firmware level, it's even possible the Tensor 'cores' are the same but dedicated to calculating two identical 16bit instructions one a 32bit CUDA 'core' using rapid packed math.
I haven't read much on DLSS itself, but I'm surprised that it seems to be an imminent technology. Deep-learning based upscaling is pretty 'mature' tech by now, but it's still highly computationally expensive. We're talking seconds for standard (admittedly unoptimised) models to do a large image. There's definitely ways to optimise neural networks for computational efficiency, but these are more theoretical than practical currently (although things like distillation and pruning are becoming more common).
Also, surprised more noise hasn't been made about other uses of Deep Learning within games. For example, deep learning for character animation or the obvious application of Deep Learning for in-game AI (I saw a presentation on this at a conference recently, along with a disclaimer "this technology has no real-world military applications" ). Unfortunately such tech is pretty finicky (bane of my life), but applications in limited environments are definitely possible, and have a huge potential value.
Technically DLSS is not related to upscaling as it's just a form of anti-aliasing that's theoretically less computational intensive, I'm guessing people are linking it to upscaling because theoretically you could use it to upscale images rendered at 1080p to 4k without the traditional overheads associated with doing so while achieving similar or better results.
EDIT: Just to clarify, the anti-aliasing is being done by Nvidia on their super computer at something like 64x supersampling and those images are compared to the input (before the 64x SS pass) they then ask the computer to recreate, using weighted inference matrices, something close to the 64x SS image from those same inputs without using traditional AA techniques.
If done right you get lots of theoretical 4x4 blocks of programs that if they individually reach whatever percentage threshold then as a collective block do X.
These things could put people on Mars whilst playing Crysis 3 at 120fps, for all I care.
I'm not paying £800-£1200 for a GPU.
Absolutely not. RT core are dedicated to tree traversal, very different work to the CUDA cores (FP and INT units). They exist because the CUDA cores suck at tree traversal. Same with the Tensor cores: packing two FP16 operations into an FP32 operation and feeding it to a CUDA core gets you only 1/32 of the way to a Tensor op (64 FP16 and FP32 operations). The Tensor cores are fixed-function blocks for doing FMA on matrices vs. individual numbers, which is what they exist (much faster - and more die-area efficient - than using the CUDA cores to do all the operations independently). Nvidia's documentation has suggested simultaneous use of CUDA and Tensor cores for performance speedups, and this has been implemented in practice.
Tensor and RT cores exist because being implemented as fixed-function blocks they are dramatically more area efficient than general-purpose float or int units. If the same result could be achieved with just 'firmware segmentation' Nvidia would have their cake and eat it too by swapping all those 'dark' units to CUDA cores when on traditional workloads for a free performance boost.
Nothing theoretical about it, that's exactly how DLSS is used. A 1080p (or other lower resolution image) is rendered as a source, and an NN (trained offline with a 64x supersampled final resolution image) uses this as an input to create a UHD output. The secondary 'DLSS 2x' mode does the same using a UHD input and a UHD output.
Yes and no:
What exactly does nvidia do with its supercomputer?
So while it may indeed be less computationally intensive on the GPUs we buy it isn't necessarily less computationally intensive overall as a truckload of work is done in advance.
So if RT and Tensor 'cores', they're not really cores but whatever, if they're not doing integer and floating point calculations what are they doing then?
Doing fused-multiply-adds or rapid packed math doesn't (afaik) require differently designed units, at its basic level a unit can either process 4, 8, 16, 32, or 64 bits of data per clock, be that with or without a floating point, how they're organised may change but the design doesn't.
RT 'cores' are doing a combination of INT and FP calculations on a data stream, Tensor cores are doing INT calculations on a grid of 4 or 8 bits of data.
It's completely theoretical as we don't know how customers and developers are going to use it yet.
DLSS isn't used exactly like that as firstly the cards aren't being sold yet and secondly Nvidia have said it's used to improve performance, it's not taking a 1080p (or other lower resolution image) rendered as a source because the rendering of outputs is performed at the final stage, traditional AA is part of the post processing done on images at the target resolution in exactly the same way as DLSS
DLSS is still part of the post processing work but like i said it's theoretically less computational intensive, it's not creating a UHD image, the creation of an image at X resolution is done much earlier in the pipeline, super-sampling is the opposite of what you've described, it takes a source image created earlier in the pipeline and down-scales it to a target resolution as part of the post processing work, that's not what DLSS is doing, at least not on customers systems.
On a customers system the image is not created at super high resolutions and down-scaled, it's either doing no scaling at all or up-scaling the source image by a factor of 2x.
True but i wasn't really considering the overall computation effort as, to be blunt, i don't give two figs how much computational effort someone else has put into something.
I never said they were not doing operations, but that they're fixed-function blocks, not general purpose logic units.
The entire point of a Tensor core or RT core is that they cannot do X or Y or Z operation per clock (i.e. you cannot send them a a piece of data and a command to perform an operation of choice on that data). Instead, they can do only one specific operation per clock and that operation is pre-set in hardware. That is why they can be more area-efficient than a general-purpose logic unit, and why it makes sense to include them in the first place.
No ,it IS taking a 1080 image and producing a UHD output. That is what Nvidia have claimed in the document, demosntrated in the RTX unveil event, and that is what developers have implemented.
Here's the high-level version: Nvidia (or the developer, but Nvidia have a HPC cluster they're making free time available on) simultaneously render the game at the render target resolution (e.g. 1080p) and at UHD with 64xSSAA (basically rendering at 30720x17280 for samples). These frames are used to train a neural network to produce the high sample count output for the low sample count input. Once this NN is trained, it's deployed to the client systems. Here, the client renders the low sample count image in real-time, then feeds the output to the NN to produce the high sample count facsimilie for display.
It's effectively upscaling the rendered image using a super-fancy algorithm rather than 'normal' algorithms like nearest-neighbour/bicubic/Lanczos/etc.
And i never said that it wasn't more area efficient or that rearranging the blocks doesn't result in a performance speedup, I even said that i suspect the RT and Tensor 'cores' are exactly the same as the CUDA 'cores' but arranged differently either via firmware (logically) or moving the blocks/units around (physically).
Says who? Who says you can't change the kernel that's running on the Tensor and RT 'cores'?
I'd find it shocking, and IIRC Nvidia even said as much, if you couldn't change the instructions running on the Tensor units from 4 to 8 bit or whether to perform a integer of floating point calculation on a data stream.
The whole reason they've included dedicated units for processing combined INT and FP calculations and INT matrices is because loading another kernel onto a CUDA 'core' incurs a performance hit, it stalls the pipeline, having dedicated units means you don't have to keep unloading and loading different kernels for different workloads.
That doesn't mean it's fixed in hardware, just that you've done a better job of divided up your resources either logically or physically so you can work on different workloads simultaneously.
No it's not, Nvida's own white paper (PDF) says...
Apologies for quoting the entire section.
They're rendering “ground truth” reference images rendered with the gold standard method for perfect image quality, 64x supersampling (64xSS), and then training the NN to replicate the results on a lower resolution image, it's not effectively upscaling the rendered image using a super-fancy algorithm, it's attempting to replicate the result of an image that was shaded at 64 different offsets within the pixel without actually doing so (on the customers computer that is).
It's not designed to upscale it's designed to replicate a computationally intensive post-processing AA technique without incurring the substantial performance hit that would come with that.
Which is why I explained exactly why that is not the case. A CUDA core is general-purpose, a Tensor or RT core is fixed-purpose. That you are unable to achieve the same perf/area through firmware or shuffling CUDA cores around is why Tensor and RT cores exist in the first place. A Tensor core performs 4x4 FMA matrix math. That's all it does, that's all it can do. Ask it to perform a scalar subtraction, and it will tell you to go suck a lemon. An RT core performs binary tree traversal, ask it to perform a matrix addition and it will tell you to go suck a lemon. If your workload is a set of matrix FMA operations, or lots of tree traversals, then Tensor and RT cores are the perfect choice. If you workload is literally anything else then they are worthless and you need to use the general purpose CUDA cores.
Client renders 1920x1080 image -> Client applies Bicubic filter to upsample to 3840x2160 -> client outputs 3840x2160 image
Client renders 1920x1080 image -> Client applies NN that outputs 3840x2160 -> client outputs 3840x2160 image
For all intents and purposes at runtime, it's a fancy and complex upsampling algorithm (and make no mistake, a fixed/non-SLNN NN is no less an algorithm than any other, just written in a different form). How that algorithm (NN) was created is immaterial to the client, regardless of how interesting the process of creating that algorithm is or how game-specific (or scene-specific) it is.
For now all the tensor cores are is a very efficient way of running a complex image processing algorithm generated by machine learning. It's very smart really. You use a machine learning super computer to generate an algorithm to do something (upscale, add aa, de-noise ray tracing output). It does this by essentially being shown before and perfect after screen shots from a game, and it learns how best to try and go from one to the other using maths. Out of that you get a compact (megabytes) program that is essentially a lot of compressed information about how to make bad images look better for a game. The tensor cores run that real time as a post process and you get better visuals.
It's going to be interesting to see what other uses the Tensor and RT cores will be put to. Some of the secondary uses for raycasting are already known (e.g. audio propagation, bullet hitscan, direct visibility checks for AI, etc), and some oft he NN applications are obvious: anything that uses psuedorandom noise for procedural generation can be 'trained' to better produce results resembling a training dataset, e.g. for procedural terrain generation (train on geospatial data), texture variation (grab the PBR dataset used to create the textures and re-use for training), smoke/cloud generation, particle motion (e.g. explosion deformation by environmental geometry, plus RT for fragment pathing). Plus a bunch of stuff that can be summed up as "produce the same or marginally better results but with much less effort" (like generation of cubemaps through path tracing for subsequent use in raster shading).
CUDA 'cores' are about as general purpose as my desktop calculator, as in they're not because they're little more than glorified calculators, they're just streaming multiprocessors.
Saying they're general-purpose would be like saying the FP units added to processors decades ago are general-purpose, or the AVX units, cryptographic units, etc, etc, they're not, they're designed to do specific tasks faster than a general-purpose processor (aka: a CPU).
And as I've already mentioned i didn't say anything about perf/area, i said that i suspect the RT and Tensor 'cores' are exactly the same as the CUDA 'cores' but arranged differently either via firmware (logically) or moving the blocks/units around (physically), you seem to be disagreeing that CUDA 'cores' can't do what Tensor and RT 'cores' can do and that's incorrect.
That's like saying before FP, AVX, and cryptographic units were added to CPUs that they couldn't do those things, and before once again you say it's about perf/area I'll remind you i never said it wasn't.
That's like saying because i took an image using a 1MP sensor that if i display it on a 64MP display that it's a 64MP image, it's not. It only becomes a 64MP image if i upscale the image and apply a bucket load of post processing so it doesn't still look like a 1MP image.
You seem to be conflating upsampling with interpolation (aka: anti-aliasing), as i said just because you display an image that's made up from 10 million pixel on a display with 64 million pixels doesn't mean the image has suddenly gained 54 million pixel from nowhere, if you want to increase the resolution you need to run the image through some form of mathematical equation so you can guess what the missing pixels should be.
To use the calculator analogy:
A CUDA core is a pocket graphing calculator.
A Tensor core is a calculator with one button that has a +/x symbol on it, and that errors if you do not type in 48 numbers of the correct length before hitting that button. It also happens to be about 1/4 the size.
If you want to perform a 4x4 FMA operation, the Tensor calculator will get it done in one button press, while the CUDA calculator will have you hammering away for quite some time and having to jot down your intermediary answers on a notepad. If you want to anything else at all, your Tensor calculator is completely useless.
There is a massive difference between fixed-function hardware and general-purpose hardware at the design level. You cannot just lop off the command units from 64 CUDA cores and jamb them together to make a Tensor core (well you could, but you'd end up with a massive power hungry waste of space for no good reason).
DLSS is that bucket of post-procesing. Plus, unless you have a little 1MP image sitting in the centre of your 64MP display, then some sort of upscaling must by definition have occurred (even if just nearest neighbour). Arrays do not magically change size without an operation being performed, and when an array holding an image becomes a larger array holding that image, then that operation is by definition upscaling.
You're the only one that has brought up AA. DLSS does not involve AA apart from at the training stage. At the client side where the real-time rendering is going on it's upscaling.
Which is what DLSS is. You keep proposing "it's not upscaling unless it does X" to which I've replied in about 5 different ways "X is exactly what DLSS is doing". You've even quoted the Nvidia document that says exactly what I've been trying to say.
DLSS takes a low-resolution rendered image, and scales it to a higher resolution for display. That's what it does. That's it's purpose. That's what's been demonstrated. That's what developers are implementing. That be the thing that it do.
I've told you a few times now that i wasn't talking about the performance or area taken up but for some reason you keep proposing that's what I'm implying, I'd appreciate it if you could stop with the strawman attempts.
Once again I'll return to the original proposition that i made, you tell me how a single CUDA 'core' is any different than a single Tensor or RT 'core' from a design POV, how is what Nvidia's marketing department have decide to call CUDA, Tensor, and RT 'cores' different from each other? And before you start talking about command units, power usage, performance, and other things not related to the 'cores' I'll remind you we're talking about individual 'cores' as Nvidia defines them.
Again you're conflating upscaling with interpolation (aka: anti-aliasing), do you know what you get if all you do is upscale an image? You get something like this...
You know how you get rid of those jagged lines? You use interpolation (aka: anti-aliasing) on the image after it has been up-scaled.
Do you know why i keep bringing up AA? It's because super sampling is one of the many types of anti-aliasing and on the client side the Deep Learning algorithms are attempting to replicate that type of anti-aliasing without incurring the traditional computation overheads associated with doing so.
I mean it's literally in the name, Deep Learning Super Sampling (AKA: Deep Learning anti-aliasing).
I've not said it's not up-scaling unless it does X, I've repeatedly been saying that it's a type of anti-aliasing. Honestly IDK what you're trying to say as it seems you don't know the difference between up-scaling an image and performing some form of interpolation on that up-scaled image to remove the jaggedness that results from blowing up a single pixel into a 8x8 pixel square block.
Oh my word, that's not what it's doing, even Nvidia has said that's not what it's doing, I mean seriously show me where Nvidia mentions up-scaling. (Hint: They mention it once in the entire whitepaper, and you know what they say?)
That's AI Super Rez, not DLSS.
If you don't believe me perhaps you'll believe Tony Tamasi, Senior Vice President of Content and Technology at Nvida when he says at the 5min mark in this YouTube video that DLSS is a combination of super sampling anti-aliasing and super resolution.
Or perhaps Nvidia's recent announcement that nine new games are going to add DLSS where they describe DLSS as...
Surely when you upscale you use some kind of interpolation method? Even if it's just NN as edzieba said?
And from what edzieba has said, an RT core can only do one thing, whereas a CUDA core can do a variety, so the RT cores are more efficient than a CUDA core at doing that thing, but can't be used for others?
I'd say you're not as when you upscale you're just taking a single pixel and making it bigger, i.e a single pixel becomes a 8x8 grid of identical pixels, when you use an interpolation method you're creating new data points based on a set of known data points, as in the grid of 8x8 new pixels may not be identical to the original as their based on whatever mathematical model you decide to use.
Example: If we up-scale a single black pixel (0,0,0) becomes a grid of black pixels, after we use an interpolation method on that now 8x8 grid of black pixels parts of the grid adjacent to a white pixel (255,255,255) may end up being a fractional sum of those two colours such as silver (192,192,192) or grey (128,128,128).
And what I've been saying is that RT cores only doing one thing is a choice not a limitation of the technology, both CUDA and RT cores can do a variety of mathematical calculations, they're both stream multiprocessors, they only difference (afaik) is that Nvidia have designated those stream multiprocessors over there as CUDA 'cores' and those stream multiprocessors over there as RT 'cores'.
EDIT: Put it this way, we don't say because a CPU like Cannon Lake is only configured with 2C/4T and some Skylake's have 18C/36T that Cannon Lake can't do multi-threaded workloads.
I have a few times now: a CUDA core can perform multiple different operations (but one at a time), on command, on a piece or small number (one or two) pieces of data. a Tensor core takes a large fixed number of pieces of data (3x 4x4 matrices, so 48 values), no more no less, and perform only one possible predetermined operation on them all at once. A Tensor core cannot be broken down into a lot of little individual independent units that you can tell do do different things, they lack the front-end circuitry to do so and lack the ability to do any thing other than the single operation they were designed to perform. Fundamentally:
Is false. CUDA cores can do multiple types of operation, Tensor and RT cores cannot. That's what differentiates them at a fundamental level, and why they exist in the first place.
Only if you use nearest-neighbour scaling. That's merely one of a vast number of upscaling (AKA resampling) techniques. Others include bilinear, bicubic, Lanczos, Sinc filtering, NN filtering (which encompasses a number of techniques including DLSS), etc. Fundamentally:
This is false, as it takes one very specific type of simple resampling (nearest-neighbour) and declares all other techniques to not be upscaling. Any time you change the size of an image, you are resampling. Any time the output image is larger, this is upsampling. There is no such thing as 'the one true upsampling technique'.
That's now how AA works for real-time graphics (and in general, that particular artefact shown in your image is not 'aliasing'. Aliasing is specifically what happens when you sample a signal with a sample rate below double the signal frequency. For real-time rendering, this means 'high frequency' detail like polygons close to each other at a distance or dense textures without proper mipmapping will be undersampled and alias. With upscaling beyond 2x like in that example, aliasing is not the source of the edge artefacts as you are above the Nyquist limit. Instead, the edge artefacts are just due to a poor choice of resampling algorithm). AA in that situation involves using additional samples per pixel. For SSAA these are 'real' samples, effectively rendering at a higher resolution and resampling ('downscaling') to the target resolution. MSAA adds extra coverage samples but only shades one sample. TXAA uses additional samples from previous frames (shifted using the depth buffer and optical flow). The closest to 'using interpolation on the image after it has been scaled' are post-process AA techniques, which are effectively targeted blurring. That are 'anti-aliasing' techniques in that they target the result of aliasing, but do so by trying to hide the artefacts rather than reduce the aliasing at the source. Notable, all techniques generally classed as 'AA' (including post-process AA) are applied to an image that is generally not rescaled: the rendered image size and the output image size are the same.
Separate names with a comma.