Announcement

Collapse
No announcement yet.

Cascading Failures?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Cascading Failures?

    I'm a technician for NCR. We've been chasing a lot of bad cap failures especially on Dell GX270 and 280 models. My question is has anyone seen cascading hardware failures from bad caps. That is, are cpu's, memory or other devices being compromised by the failed caps? On several boxes now we have replaced the motherboard only to have the system still go into spontaneous shutdowns and mysterious "thermal event" failures. One was fixed with a power supply replacement but another is now to the stage of motherboard, power supply and still failing. We're at the point of replacing the cpu. The boards we initially replaced had blown caps on them. Has anyone else had problems like this?

    #2
    Re: Cascading Failures?

    Originally posted by VyPeRR
    I'm a technician for NCR. We've been chasing a lot of bad cap failures especially on Dell GX270 and 280 models. My question is has anyone seen cascading hardware failures from bad caps. That is, are cpu's, memory or other devices being compromised by the failed caps? On several boxes now we have replaced the motherboard only to have the system still go into spontaneous shutdowns and mysterious "thermal event" failures. One was fixed with a power supply replacement but another is now to the stage of motherboard, power supply and still failing. We're at the point of replacing the cpu. The boards we initially replaced had blown caps on them. Has anyone else had problems like this?

    I've never met a dead CPU due to bad caps, but since the caps are responsible for keeping Vcore clean and prevent droop under CPU load transientsd, presumably cap failure could cause RAM or CPU failure, but what i've seen more is roasted MOSFET's or perhaps the ATX connector.

    Comment


      #3
      Re: Cascading Failures?

      I've seen a CPU destroyed by bad caps

      Comment


        #4
        Re: Cascading Failures?

        Originally posted by Rainbow
        I've seen a CPU destroyed by bad caps
        so have i.
        usually amd and caused by shorted vrm fet caused by bad caps.

        Comment


          #5
          Re: Cascading Failures?

          I've seen a Coppermine Celeron too. The mosfets were not shorted, only caps bad. The owner probably turned it on again and again to the point where the caps totally failed and the spikes from the switching regulator killed the CPU.

          Comment


            #6
            Re: Cascading Failures?

            yeah but the OP is not talking about a dead cpu. i have not seen the issue of a cpu remaining partially unstable following a recapping. Usually it is some other perhaps unrelated issue like the psu, ram or hard disk failing that causes instability to remain.

            if the psu has been replaced then that leaves ram and hard disk.

            Another issue could be cooling but i think it is a long shot. When the board is replaced, is the cpu cleaned and new thermal compound added? Is the heatsink properly attached and the fans working.

            It would be interesting if bad ram/psu could be proven as caused by a motherboard with badcaps. Even more interesting if that cpu became bad. Keep us up to date with the progress.
            capacitor lab yachtmati techmati

            Comment


              #7
              Re: Cascading Failures?

              Partially dead/unstable CPU is unlikely IMHO.

              Comment


                #8
                Re: Cascading Failures?

                If the processor is bad, it wouldn't shut off, unless there's an overheat or a thermal diode malfunction. If the processor is bad, you probably would be getting crashes instead. Do you always get an error message, lock up or an application being terminated with no error message when running something demanding on the processor? If you do, check the voltages.
                ASRock B550 PG Velocita

                Ryzen 9 "Vermeer" 5900X

                32 GB G.Skill RipJaws V F4-3200C16D-32GVR

                Arc A770 16 GB

                eVGA Supernova G3 750W

                Western Digital Black SN850 1TB NVMe SSD

                Alienware AW3423DWF OLED




                "¡Me encanta "Me Encanta o Enlistarlo con Hilary Farr!" -Mí mismo

                "There's nothing more unattractive than a chick smoking a cigarette" -Topcat

                "Today's lesson in pissivity comes in the form of a ziplock baggie full of GPU extension brackets & hardware that for the last ~3 years have been on my bench, always in my way, getting moved around constantly....and yesterday I found myself in need of them....and the bastards are now nowhere to be found! Motherfracker!!" -Topcat

                "did I see a chair fly? I think I did! Time for popcorn!" -ratdude747

                Comment


                  #9
                  Re: Cascading Failures?

                  Partially dead or unstable CPUs can occur due to bad capacitors, and they'll go fully dead eventually if the capacitors left unattended.

                  I've lost a Throughbred B Athlon XP and a Coppermine PIII/866, and I have a flaky Coppermine PIII/800 which draws much more power than typical, and will not run on most motherboards. Ironically, I've never had any Celeron fail on me.

                  Comment


                    #10
                    Re: Cascading Failures?

                    We replaced the mainboard again and lo and behold the system is now stable. This is twice in 2 weeks on two different Dell models (old GX150 and GX270) that we've had to replace the mainboard twice. I must admit the 150 was a spontaneous shutdown issue too, but of course no thermal failure message as I don't think the 150 had that feature. Also I should clarify the thermal message I refer to is captured in the BIOS event log not windows. I wonder where our replacement boards are coming from? They look fine, all caps flat and sealed. As to the question concerning the thermal interface we do clean both cpu and heatsink and replace the thermal paste. It's not an actual thermal failure but the Dell box thinks it is. Maybe the problem is caused by damage to the thermal sensor? One thing is for sure, we are starting to see a pattern. Also to clarify the shutdowns happen about 10 minutes after a cold start and then the system will not complete a bootup all the way into windows after that. The processor seemed very slow while it was running so it was probably throttling down due to the false overheating detection. All shutdowns are instantaneous with no prior message. This is a very strange problem since it definitely occurs on the replacement boards with good caps. How does the Dell detect thermal issues? Is there an SMT thermistor under the proc socket?

                    Comment


                      #11
                      Re: Cascading Failures?

                      It's probably using the CPU features - P4 can tell you if the temperature is high. With bad caps, the power is bad and the CPU can do silly things.

                      Comment


                        #12
                        Re: Cascading Failures?

                        Hello all and especially Vyperr, I too am an NCR tech, now tasked with doing SBD dell calls as well. I can tell you that in the retail space it is not uncommon to replace the psu as well as the mainboard on GX270's, in the last 8 months I would estimate I have replaced over 2 dozen GX270's for this reason. Of that, only 2 needed PSUs on the spot and about 6 needed them afterwards. To elaborate on your symptom list we have (in the retail, non-dell-warranty) one customer that consistently has the thermal event failure. The most common is the failure to boot/amber power light issue. And by far the most catastrophic is the "corrupted data on southbridge" I would normally blame this on the ide controller itself, but just recently happened upon on at a customer site with a raid controller. It was a promise, so I guess it could have been the controller :-o.
                        And on the thermal event, you are correct the GX150 (i perosnally had one go) has no thermal management, the 270 does simply through the P4 thermal protection deallie. I have to run, but good to see a fellow tech with a clue!

                        Comment


                          #13
                          Re: Cascading Failures?

                          One more post shower, pre lunch post. NCR gets thier boards from dell, based on the contract we have with them. They are of the newer rev, I am fairly certain we scrubbed our stock of the old ones. Keep in mind that the GX270 flaw (and associated dimension series) was discovered long before the super widespread issues of today. I have only seen TWO proactive replacement roll-outs, and both were large national companies. No, they weren't 'the' super big national company.
                          Incidentally, as we speak I have 5 GX270 boards in my vehicle, all of them dead, all from the cap issue, all of them from the past week.
                          I have only had to replace boards once, I would bet that replacing the psu in the affected systems would be a good course of action, but I don't disbelieve the idea of 2 DOA parts, or at least old/flawed, I'm sure there is still stock floating around. I guess its important to note that the customer I spoke of was not a single site, we've had 6 of the 8 within my territory go, all the same way (therm event).

                          Comment


                            #14
                            Re: Cascading Failures?

                            Well being in Canada I wouldn't be surprised that all our replacement stock is old rev. The sad fact is that all McDonalds in Canada have these pc's. This is where the bulk of our failures are popping up. We don't actually support Dell desktops directly here, just through contracts with end users. We have the much more enjoyable contract of fixing laptops on the kitchen counter of University students.

                            Comment

                            Working...
                            X