Announcement

Collapse
No announcement yet.

Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

    I am trying to make this short. My aim is to determine whether the defect could be weak soldering spots on my GPU (GTX 970), or defective components on the GPU. (Caps, chips etc.)

    I already did lots of troubleshooting and can relatively certainly exclude PSU, board, so the problem is is with the GPU. This is a ca. 3+ years old EVGA GTX 970.

    Since two weeks or so, the GPU is often not seen already in the BIOS. The board shows the red VGA light, indicating a problem with the GPU. (Internal GPU no problem). The board doesn't boot and is "freezing" VERY early during POST with the VGA light on.

    Approx. each 10-20 tries to turn on the PC, the GPU is suddenly recognized.

    I can then boot, and the system works without a problem. I can even game and do whatever.

    Similar to needing 15 tries to switch on the PC until the card is seen, I can also disable/enable the GPU from the running system (Windows 10 x64) from device manager. Most of the time, when I re-enable, the device has a problem, but after 6-8 tries it finally "gets it", and then again it works flawless.

    What could explain this strange behaviour? Weak soldering spots, which I could theoretically fix with some type of ghetto-reflow, or would this more indicate a defective cap or other defective component?

    What puzzles me is that the problem as it seems merely affects initialization/detection of card, and there is no flakiness WHEN it's seen and the system works. (Common sense tells me a flaky soldering spot would result in a very instable/flaky card and I shouldn't be able to play games with it for hours etc.)


    (Note, this is not a software problem, the system worked without a hitch for 3+ years, the problem appeared about 2 weeks ago. System boots fine all the time with IGPU and with the problematic card removed. I troubleshooted already everything possible, like re-seating many times, using other PCIE slot etc.). Although there is a small chance it's the PSU, at this point I would say no since everything else works. I just can't 100% exclude the PSU simply because I didn't test with another GPU in the slot, just with the internal IGPU. But here too, if the PSU for some reason wouldn't supply adequate power to the card any longer (this is only speculation!), I think I would see problems during normal use, and not just with initializing the card?)
    Last edited by flexy123; 08-03-2018, 09:33 PM.

    #2
    Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

    this type of symptom is 90% bad/failing bga on the gpu. the gpu works when hot and at high load cos the solder expands and closes the electrical connection. when cold esp cold booting, the solder contracts and has an open electrical connection, therefore its not detected or there are issues activating the gpu from cold.

    u did not specify what model of psu u are using. please specify the brand and model of your psu. there are problematic psus that some of the members here know about so please state what psu u are using to rule that out.

    Comment


      #3
      Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

      A Thermaltake Toughpower 750W (?) or 800W, old, but good.
      The same PSU I already replaced a bad cap years (!) ago upon advice from this forum. Bad cap on the 3VSB rail, some guy here had the same problem with the same PSU and cap. But this time I am relatively certain the PSU is not the problem.
      Last edited by flexy123; 08-04-2018, 01:54 AM.

      Comment


        #4
        Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

        I just did an attempt at a "ghetto reflow" using a heat gun (350 C), but same behaviour.

        (But I was very paranoid with the heat gun to not destroy the card, so it's possible I didn't heat the GPU long enough. I waited until I saw the flux/paste I had there started to smoke and saw it bubbling, holding the gun about 7-8" from the GPU for maybe a minute. Since the gun is 350 C at that setting (low), I was afraid to go closer. Anyway, I can't help but think this is not BGA related. I am also not sure whether the chances that the card is seen increases with heat, aka when the PC is on for a while. Behaviour is always the same, takes me 10 or so times until the card is detected. Sometimes, the card is also seen as a plain VGA, but at some next reboot then the card is seen properly. And *if* it's seen and I boot, everything is fine and dandy incl. games/benchmarks.

        Anyway, as I was doing the ghetto "reflow", I also a had chance to connect an old GTX 670, which has about the same or even more power requirements than the 970. This booted right away. Means that we can 100% exclude the PSU now.
        Last edited by flexy123; 08-05-2018, 01:40 AM.

        Comment


          #5
          Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

          Sorry to say, but the GPU chip on your video card is just on its way out. I bought a Radeon HD6850 on eBay that would do something similar - not detect at all on a cold boot until the card would reach the "right" temperature (not too hot and not too cold). Then it would boot and work in Windows (but not with a load like yours). In any case, a "reflow" (i.e. just a simple re-heating) fixed mine... at least for now. But I'm not using that card much as it is right now, so I can't tell how well it will hold up with time (being a newer AMD card, probably not too long).

          So in short, the issue is indeed with the GPU chip.

          Originally posted by flexy123 View Post
          I just did an attempt at a "ghetto reflow" using a heat gun (350 C), but same behaviour.
          It's possible that you didn't heat the GPU enough for the solder between the GPU die and substrate to melt (often the hardest part).
          But it is also possible that your GPU chip is just dead/dying beyond repair. I have many cards like that - different issues (mostly artifacts), but still remaining the same way after multiple reflows. At that point, I usually just salvage them for parts (ceramic caps, poly caps, MOSFETs, etc., when needed).

          Originally posted by flexy123 View Post
          (But I was very paranoid with the heat gun to not destroy the card, so it's possible I didn't heat the GPU long enough. I waited until I saw the flux/paste I had there started to smoke and saw it bubbling, holding the gun about 7-8" from the GPU for maybe a minute. Since the gun is 350 C at that setting (low), I was afraid to go closer. Anyway, I can't help but think this is not BGA related.
          First, it's worth noting that every flip-die GPU or chipset chip actually has TWO sets of BGA: one between the board on which the chip is soldered to, and another between the actual GPU/chipset die (the black silicon square) and the small square PCB (called substrate) on which it is soldered to.

          With that said, most failures usually occur between the latter (i.e. the chip die and substrate). Because the BGA connections between the chip die and substrate are epoxy-filled (rubber or plastic-like glue seen next to the chip die), that means that a reflow will rarely make the chip die BGA reconnect back to the substrate, because the chip die cannot move.

          This is why most reflows are a temporary fix only and also why some chips cannot be fixed at all.

          On that note, you actually don't need any flux at all when you do these "ghetto reflows", because 99.9% of the time, it's not the BGA between the board and the substrate that fails.

          Reflows aside, another way to keep your GPU working is to never turn off or let your computer sleep/standby - just keep it on all the time. Also, turn up the GPU fan speeds so that the card stays cooler - especially under load (may need to bump the fan speed to 100% if the card's cooling is a little weak). Being a newer nVidia card, maybe see if you can limit the TDP to around half to 3/4 of what it is. This should help with the temperatures under load. Of course, the card clocks may become lower and the card may not perform as good as it did. But it's one way to still keep that video card running.
          Last edited by momaka; 08-21-2018, 02:21 AM.

          Comment


            #6
            Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

            Thanks for your reply!

            Yes, the GPU is so old that to be honest right now I am not bothering, if I do another attempt then only *at least* with a thermocouple so I can have an halfway idea about the temperatures. At this point, while it is quite annoying since it now takes me sometimes 5mins until I can use my PC in the morning (after 10-12 restarts), I am still baffled THAT it works without a hitch once it does get detected. Even under full load.

            You are correct, I could save myself the hassles and just have the PC go to sleep. For now, I know it *will* come on at some point, and I also haven't seen the symptom worsen.

            On a related side-note:

            I am more mad since I virtually "baked" my wife's old notebook PCB to a crisp a month or so back, this is before I learned about the heat-gun method. (I never used the heat gun since I assumed it would blow off caps etc., which sure can happen). I am 100% convinced the heatgun could have "fixed" it, since the notepbook had the classic symptoms, eg. artifacts at boot, indicating it's GPU and a good chance that a "reflow" would fix it.

            (I did successful oven "reflows" already previously, but there is something wrong with this oven that it reaches a much higher temperature than selected. But of course I thought "nothing to lose", with the result I knew I effed up when I the caps started popping off When I took the PCB out the oven, well it was "very well done", to say it mildly...)
            Last edited by flexy123; 08-23-2018, 05:15 PM.

            Comment


              #7
              Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

              Originally posted by flexy123 View Post
              You are correct, I could save myself the hassles and just have the PC go to sleep. For now, I know it *will* come on at some point, and I also haven't seen the symptom worsen.
              No, don't put the PC to sleep/standby or hibernation - just leave it always On.
              Just because the symptoms don't appear to worsen doesn't mean that they suddenly won't at some point (actually, most likely they will worsen suddenly and then you won't be able to use your PC anymore).

              Originally posted by flexy123 View Post
              I did successful oven "reflows" already previously, but there is something wrong with this oven that it reaches a much higher temperature than selected.
              I doubt your oven reaches a temperature much higher than dialed. Maybe 10C higher, tops.

              The problem with oven is how quickly they heat up and what type of heating they use. Electric ovens can be the most problematic, as they do lots of heating through IR radiation. This can cause certain components on the board to heat faster than others (typically non-shiny and darker components). Gas is the best when it comes to heating slowly to the right temperature. However, inside a gas oven, the byproduct of burning gas (fossil fuel) is CO2 and a tiny bit of H2O (water/moisture). This moisture, because present in a tight/enclosed space, can penetrate through PCBs and cause other problems.

              So for these reasons, I avoid oven reflows entirely. If I really had to choose between oven reflows though, I would do electric, then either pre-heat the oven before putting the PCB in there or raise the temperature slowly in two to three 5-minute intervals.
              Last edited by momaka; 09-01-2018, 02:07 PM.

              Comment


                #8
                Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

                i became expert with the oven. even tough i got a bga station sometimes you just need a quick fix in 15 min.

                20L electric oven its the best. you only use the top element and in order to make a successful reflow, you preheat the board with the full tray by placing it on lowest point.

                you set it on 200C leave the door open and let it stay there for 10 min like that. ir gun or sensor will need to show the PCB has reached 90C then close the door and let go for 5-6 min.

                top notch reflow. i have a GTX 780 running for a year now. i think it will fail soon but hey. i go that card for 16£.
                if you add FLUX then cut the minutes to 4-5 but you have to reADD flux before closing the door because it's burning out something like evaporating.

                AMD its shit. No oven can save them. you do even worst.
                ONLY BAKE NVIDIA CARDS
                Just cook it! It's already broken.

                Comment


                  #9
                  Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

                  AMD cards can be baked as well, especially laptop ones (well, be it an MXM card or a GPU soldered to the motherboard). Decent "success" rate on the HD5000/HD6000 series. Desktop cards always have a lower success rate. They fail less to begin with, and I'd say it's not caused exclusively by thermal and physical stress.
                  Also works for defective AMD northbridges RS780 and RS880, both for desktop and laptop too. Again, desktops fail much less. Better success rate for the RS880.
                  Also adding flux is unecessary since you're not reflowing the GPU. If ever you are reflowing (almost always useless but whatever) you have to add flux. Just pay attention if ever someone did reball the chip already, the chip will be reflowed below 200°C…

                  And again, it'll not last long and it's very unpredictable.
                  OpenBoardView — https://github.com/OpenBoardView/OpenBoardView

                  Comment


                    #10
                    Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

                    Originally posted by dj_ricoh View Post
                    AMD its shit. No oven can save them. you do even worst.
                    ONLY BAKE NVIDIA CARDS
                    I've had the exact opposite luck. Never was able to save an nVidia card (even the G92 8800GT/9800GT cards, which are supposedly the most revivable). Only was able to get a nForce 6150 chipset working twice (one desktop and one laptop).
                    AMD stuff, on the other hand, I've had fairly good luck with (but mostly older stuff like HD6000 series and older).

                    Originally posted by piernov View Post
                    Decent "success" rate on the HD5000/HD6000 series.
                    Add to that the HD3000 and HD4000. HD4850/4870 are actually probably the best when it comes to that. You just have to keep them very cool afterwards (no more than 50-55C for the core and memory IO temps).

                    Originally posted by piernov View Post
                    Also adding flux is unecessary since you're not reflowing the GPU. If ever you are reflowing (almost always useless but whatever) you have to add flux.
                    I think you meant re-balling there.
                    But yes, I agree that adding flux for a reflow is unnecessary.

                    Comment


                      #11
                      Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

                      I AM ABSOLUTELY...MIND.BLOWN:

                      I had my GTX970 now several times in the oven, at various temps. One time, all the caps popped off. (Yes, I am an idiot, but well....)

                      So I spent considerable time desoldering caps from another GPU I had and re soldered them to the GTX970. ("Pain in the ass" would be the understatement of the century)

                      Following that, I treated my GPU several times with the heat gun, at various temps, using the thermocouple and then without. Then I baked it again. Then I resoldered caps again. Multiple times when I had the impression that the caps were not in well. (With one cap, I had a pin stuck in there which I couldn't get out for several days, so I soldered the cap to the back side)

                      It didn't do a thing, the PC wouldn't even boot. (Means, it looked like the GPU finally ate it, no wonder what I did to the card. Before, I could get it work after about 10 or so boots, but now, no dice at all. Card was deader than dead.)

                      Today, I came up with the idea to put it in again, but run from the IGPU but with the very toasted GTX 970 in the slot.

                      I see that the card is in device manager, but Code 43 so I can't activate it in Windows.
                      I do some more testing, and read the BIOS with NVflash, showing that the BIOS chip is ok and I can save the BIOS. (Defective bios chip was one thing I speculated about). All looks alright in nvlfash and I see the card and the EEPROM listed.

                      I come up with the idea to re-flash the original BIOS. I reboot, and the effing card WORKS. I am absolutely baffled, not only that the card works now, but that it works at all after what I put the card through. I could've sworn that I long baked the GPU dead, especially with the heat gun, since I was 100% convinced the card being dead, so I didn't take that much care anymore what I really did. It also looks horrible with from the thermalpaste I used.

                      So whatever fixed the card, the bios re-flash fixed it. I already did some tests, benchmarks etc. and it's fine. Mind blown.
                      Last edited by flexy123; 09-29-2018, 04:37 AM.

                      Comment


                        #12
                        Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

                        Originally posted by flexy123 View Post
                        ...So whatever fixed the card, the bios re-flash fixed it. I already did some tests, benchmarks etc. and it's fine. Mind blown.
                        I really hope that was the issue and nothing else. However, gut is telling me your card just got warm enough to work. Give it a week or two (or maybe a month) and then report back. I honestly would be surprised if it's still working then. Most likely a cooling/power cycle or two and it will be back to broke.
                        Last edited by momaka; 09-30-2018, 08:11 PM.

                        Comment


                          #13
                          Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

                          I've never seen a graphics card with VBIOS issue (well, except if it was messed with), but since it does happen for computer BIOS, maybe it can also happen on graphics cards.
                          OpenBoardView — https://github.com/OpenBoardView/OpenBoardView

                          Comment


                            #14
                            Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

                            I have seen this when people mod the bios to overclclock and it push it to the limit, then it fails and the bios needs to be reset. Rare but happens.

                            Comment


                              #15
                              Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

                              Yeah I have no idea. The strange behaviour of the card, that when it run then it run flawlessly, and in general, makes me believe that it was nothing related which could've been fixed with "reflow". I have a feeling I could have "fixed" it before I did anything with the oven or the heatgun.

                              There is *one* difference tho, that the card doesn't run at 16X PCIE3.0 speed anymore but at 8xPCIE 3.0, but this is a non-issue since this bandwidth is never used anyway with only one card.

                              The card now is in my wife's PC (lol), it works fine for several days. (She games all the time).

                              Edit: The memory also overclocks slightly lower. Before I could overclock the memory +370, now it does "only" +200. If I go over, the card just freezes. (The EVGA always had this weird behaviour. It never artifacted like a "proper" card when you overclock it, it just freezes or crashes. I suspect the EVGA have bad components out of the box, especially issues with VRMs and general voltage regulation. It had issues since they day I got it)

                              But since wife is not overclocking at all, this does not matter either.

                              PS. I replaced all caps with caps from a Gigabyte GTX670 I had, same uF values, but slightly higher V rating which should be ok. (Eg. original caps rated at 2.5V, the new ones 3V).
                              Last edited by flexy123; 10-03-2018, 01:24 AM.

                              Comment


                                #16
                                Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

                                Originally posted by brethin View Post
                                I have seen this when people mod the bios to overclclock and it push it to the limit, then it fails and the bios needs to be reset. Rare but happens.
                                I flashed the card *excessively* since I once was big into bios modding. And with excessively I mean, possibly 100+ times. It is possible that the frequent flashes were not good either. I am sure there are limits in how often you can flash an eeprom?!

                                Comment


                                  #17
                                  Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

                                  Originally posted by momaka View Post
                                  I really hope that was the issue and nothing else. However, gut is telling me your card just got warm enough to work. Give it a week or two (or maybe a month) and then report back. I honestly would be surprised if it's still working then. Most likely a cooling/power cycle or two and it will be back to broke.
                                  I watched this previously but could not really confirm that the card getting warmer improved things. I actually hoped this would happen since this would have explained things. But it took 10-15 boots to "catch", no matter what. It was, according to all logic, very dead.

                                  But when I put it into the oven, and then the caps popped off and I replaced them, it wouldn't initialize at all anymore. That is, until I reflashed it.
                                  Last edited by flexy123; 10-03-2018, 01:34 AM.

                                  Comment


                                    #18
                                    Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

                                    Originally posted by flexy123 View Post
                                    I flashed the card *excessively* since I once was big into bios modding. And with excessively I mean, possibly 100+ times.
                                    Perhaps you should have put that in the original description too.
                                    So now there really is no telling if you just had some marginally flashed BIOS in there or not or if the card just had BGA issues. I suppose time will tell what's gonna happen with this one.

                                    Comment


                                      #19
                                      Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

                                      Yes but when I flashed the card often, this was 3 something years ago, afterwards I didn't flash anymore and just used the card in my PC for a looong time. So it couldn't have been a a flash gone wrong. What is possible (??) that the frequent flashing somehow wore the eeprom down, but this is only speculation. I have no idea whether something like this is even possible.

                                      That being said, card works wonderfully (again), no issues from what I see.
                                      Last edited by flexy123; 10-10-2018, 12:28 AM.

                                      Comment


                                        #20
                                        Re: Help me troubleshoot defective GPU (not always detected w/ VGA LED on)

                                        it could be that the continuous heating and cooling cycle of the reflow caused voltage leakage in the flash mem cells causing data corruption. this has a higher likelihood of occuring in flash mem cells that have undergone many write-erase cycles.

                                        Comment

                                        Working...
                                        X