The problem is two fold:
- NVIDIA had a bad run of GPU chips which Apple did offer a replacement program (full logic board swap-out). Sadly the program has ended.
- What also occurred the solder joints holding the chip to the logic board also failed due to the excessive heat these chips create when running heavy graphics (i.e. gaming & video).
What happens is the tin within the solder crystalizes turning a conductor into a semi-conductor. While some people have reheated the logic board to the point the solder reflows the problem with this solution is the solder has already crystalized so it just happens again quite quickly (as you have discovered).
The only way to fix this correctly is to completely remove the solder on both the chip and the logic board pads and re-ball the chip with fresh solder. I would also cheat here if you still have some high temp lead based solder as the new stuff (lead free) just can't handle the heat.
At this point in time its getting hard finding good logic boards to swap out and the rev B chip is also hard to find new.
Re-balling is a tricky job unless you know what you are doing and have the correct tools I wouldn't try it. Many people kill their logic board in the process and besides you likely have the rev A chip so you still have a problem internally within the chip.
OK, We went over the chip side of the problem. Now we need to talk about some of the other causes of the chip getting hot. Putting aside the applications that are heavy on the GPU.
There are things you can do to help your system. For starters I would monitor the thermal sensors more aggressively using an app like: Temperature Gauge Pro. I would also clean and re-apply a good thermal past on the CPU, GPU & Hub controller chips using: Arctic Silver ArctiClean & Arctic Silver Thermal Paste. I would also make sure the fans and the heat sink fins where clean of any dust buildup and make sure the rest of the system was clean internally.
crwdns2934105:0crwdne2934105:0
crwdns2934113:0crwdne2934113:0
crwdns2915270:0crwdne2915270:0
crwdns2889612:0crwdne2889612:0
2