Announcement

Collapse
No announcement yet.

MSI GTX 1070Ti problem with PCIE?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    MSI GTX 1070Ti problem with PCIE?

    Hi All,

    I have a problem with my 1070Ti.

    9/10 times card fails to post in PCIE x16 socket. When it finally posts it shows random PCIE reading in GPU-Z. Rarely it shows x16, most of the times it will be either x4 or x1, sometimes x8. But still it will crash with BSOD in x16 slot sometimes even after 1 minute and sometimes it will happen after an hour.

    Moving card to x4 PCIE slot improves it. It will post most of the time (I would say 8/10 times) and I can even play games for hours at full speed (I haven't spotted and slow downs in FPS vs x16). However randomly it will BSOD while doing random stuff like browsing the Internet so it's not the stress of the game, it's completely random like browsing the Internet.

    What I tried so far:
    I tried fresh Windows, fresh drivers. Did not help.
    I ruled out Motherboard, I gave my card to a friend who gave me his 1080 and his card works perfectly fine in my PC but mine wouldn't post in his PC and he told me once it did but upon installing drivers it went to black screen.

    I googled a lot and I tested PCIE lanes.

    Here are my results:

    Side A (address):
    Diode readings from GPU to caps: between 0.356-0.360
    All 32 caps are not shorted and showing between 205 - 222 nF readings

    Side B (data):
    Diode readings of all lanes between 0.523-0.526

    I also ran MAPS and all memory showing correct. (although mine did not show any colors while testing, it went blank but maybe it's because my video cable was connected to the card I was testing)


    I am out of ideas. Could that be damaged core? But like I said I can even play games for hours in x4 slot, but then all of a sudden it will crash while doing absolutely nothing like watching Youtube.

    Please help
    Last edited by ca-gamer; 10-27-2024, 11:14 PM.

    #2
    I'd guess this is a hardware issue. I would look on the following things in the given order:

    1. Remove the oxidation layer of the copper card slot contacts, very good for this is a pen with a fiber glass core. Wipe the contacts until you get a shiny copper color. Then you can also apply some bit of contact spray as protection. Apply the contact spray also to the 12V ATX connector including the PCIe cable connectors and the PSU connector (if it's modular). Check visually the inner contacts of the cable connector, if they have not become loose (widened).

    2. Stability of the GPU power supply lines, check with a scope the all the power lines, for PWM VRM's also the look of the signals before the chockes. If possible do it when some load tests are running. Remember someone described such an issue somewhere here. You'll need to McGuyver some small temporary heatsink to do this diagnostics to have access to the top of the card.

    3. Do a reballing if the GPU, eventually in second step reballing the memories if GPU not helped.

    4. Threaten the GPU that you will kill it, if it does not recover 😀.







    Last edited by DynaxSC; 10-28-2024, 07:04 PM.

    Comment


      #3
      Thank you for your help.

      I don't have good news unfortunately.

      I decided to run another session of MATS (the previous one that showed no errors I ran couple of weeks ago where to problem was less frequent). Here are some of my results. What's "funny" is that I can run a test showing absolutely no errors just to rerun the same test 5 seconds later and I have a bunch of errors and then another test showing completely different results. I tested multiple version of mats and all of them show random stuff.

      I am not an expert but for me the only logical explanation is core problem. I don't suspect memory is the issue here because it makes no sense to show no errors followed but all memory chips failing at once and then only 2 of them and so on...

      Attached Files

      Comment


        #4
        For me it looks like a contact issue resulting from thermal material effects. Sure, it might be inside the GPU, but with luck in an external location. I would not give up so fast. What you can do to narrow down the reason is some bending/pressing of the board during operation, to see if there is a reaction to material stress. If there are cold solder points this might, but must not show it.
        Last edited by DynaxSC; 10-30-2024, 05:02 PM.

        Comment


          #5
          Originally posted by DynaxSC View Post
          For me it looks like a contact issue resulting from thermal material effects. Sure, it might be inside the GPU, but with luck in an external location. I would not give up so fast. What you can do to narrow down the reason is some bending/pressing of the board during operation, to see if there is a reaction to material stress. If there are cold solder points this might, but must not show it.
          And you are absolutely correct!

          First of all let me clarify my previous post where I claimed that I was receiving random write errors on each test.

          Thanks to this reddit post (https://old.reddit.com/r/GPURepair/c...s_detected_by/) I learned that this is normal and is a false positive because I was testing a card that is working as a current display.

          Running command ./mats -b 60 -e 70 excludes first 60 MB of memory from test and after that all my memory chips showed 0 errors even running ./mats -b 60 -e 150 so I knew memory is good.

          Then completely by accident I discovered that my GPU started working normally at full speed x16 3.0 and I did multiple reboots and still card was working fine. Why? Because I was tired of constantly plugging and unplugging my card so I kept my PC horizontally on the ground. This made the GPU in vertical position essentially removing and weight stress or any bending/sagging.

          So yeah, 100% it's a solder ball issue under GPU, the thing is that couple of months ago I was moving with Uhaul truck. For 2 days I was driving and I did not now that it's better to remove GPU from MB for transport. I bet my GPU was constantly jumping and bending. I killed it.

          Right now I am able to play fine but PC needs to lay horizontally.

          Let this be a lesson for everybody

          A. Remove card from MB if you transport your PC
          B. For heavy GPU's buy a support bracket or ideally install card vertically using PCIE riser.

          I screwed up this part.

          I know that reflow would help but I've never done this before and I prefer not to risk it.

          Comment


            #6
            I would'nt blame myself for the transport at all. The card gets much more stress from the high temperatures the GPU is achieving during high load. The different materials have different expansion coefficients.The board material and gpu stretches and shrinks, also bends due to temperature changes, and this happens again and again, hundreds of times, same way like the fan blades in jet engines. The distances here are very small, but the forces are very high, much higher than during a transport. The jet engines fail sometimes, same as the ball connections of the GPU pads to the board pads. The tin used for the balls is pretty brittle, not very elastic. With time you get micro cracks reponsible for non-conductance. In my opinion this is the major reason for the failure. Therefore CPU's on PC boards are mounted on sockets with elastic pins. GPU vendors save money on sockets, and do it in my opinion deliberately using ball connections, as the boards are failing after some time, repair is difficult, expensive and somewhat risky, and customer must buy a new card 😁 raising the shareholder value of the card and chip vendors - this is the world nowadays. Imagine you could easily replace a broken GPU on a graphic card with socket, or even upgrade to a newer version, why not, this is technically possible, but nobody is interested to do so.
            Last edited by DynaxSC; 11-02-2024, 06:21 PM.

            Comment


              #7
              I don't agree, mechanical damage has become one of the most common failure on high-end modern graphics cards. They do get damaged by their own weight, even without external physical stress. The pads under VRAM ICs and GPU get pulled, the PCB cracks near the back of the PCIe edge connector. Reflow cannot fix that and will typically create more damage. As it always has been, BGA balls themselves being damaged only is rarely (if ever for desktop graphics cards) a problem.
              A 1070 Ti is not quite a 4090 in terms of dimensions and weight, but it can definitely get damaged if the tower is transported upright without any support material. Shouldn't be too bad if the tower is laid flat (so that the graphics card stands upright).
              OpenBoardView — https://github.com/OpenBoardView/OpenBoardView

              Comment


                #8
                You are right with this modern, heavy and 3 vent's cards. Therefore I never go for this long 3-vent's cards, I prefer 2 vent's cards, as they do not have this issue so extensively. From cooling point of view, to my experience the 3 vent's construction is overkill, except maybe for the most powerful gpu's. And they definitely need a separate fixation tool on the other unsupported card edge. 2 vent's cards will most of the time go well with the standard fixation with screw to the case.

                Comment

                Working...
                X