Strange reset during the day

Started by Alain Boulet, December 26, 2012, 09:15:51 PM

Previous topic - Next topic

boB

Quote from: aroxburgh on May 10, 2013, 02:35:48 AM
Bob:

With a modicum of software ingenuity (perhaps with a small hardware hack on your resetting 150) you should be able to *thread* the Classic 150 watchdog, to see what was on the CPU stack just prior to the reset. This approach should end the guessing game as to the cause of the random resets...


This is exactly how the area was localized.   That was pretty easy to catch.

Now, it's just a matter of time until the cause is found.  That won't be until at least next week.

But you have changed my theory somewhat as you have not talked with the Classic with
the Local App while it was also talking to My Midnite while in Static mode.  But maybe something
else was talking  to (or tickled ?) the Classic's  IP address at just the right time to cause the
reset ?

Nothing newer than the Local App or My Midnite for the moment though.  Sorry about that.
Hopefully this is the last of the reset problems for a while after it is figured out.
Most are having no issues.   Wish it was easier !

I will forward this info off.   Thank you for the informative and accurate input !

boB
K7IQ 🌛  He/She/Me

dgd

Quote from: boB on May 10, 2013, 12:17:10 AM
DGD, does that resetting Classic  have Web Access enabled (My Midnite) as well as being
accessed by the Local App ?

The reason I ask is that it looks like another way it can reset is to have (1) web access
enabled AND (2) one of "those" routers AND (3) accessing the Classic through the Local
App  AND  (4)  being in STATIC IP mode.


yes to all above. 3 Classics -> hub -> Linksys WET54G -> Linksys WRT54G ->internet
Normally all good no resets then spate of several over a day on just the loaded PV Classic. And only when input watts is highly variable like sunny day yesterday with fast moving clouds.  I can't make resets happen they appear really random
Thanks for your efforts to resolve this  problem

dgd
Classic 250, 150,  20 140w, 6 250w PVs, 2Kw turbine, MN ac Clipper, Epanel/MNdc, Trace SW3024E (1997), Century 1050Ah 24V FLA (1999). Arduino power monitoring and web server.  Off grid since 4/2000
West Auckland, New Zealand

offgridQLD

QuoteAm I the only one still seeing these resets? I suspect much of my reporting from this Classic is invalid as a few times I see the date reset back to 2003 which is likely evidence of a random resets

No I am getting them randomly usually in the evening between 6pm and 10pm while resting. But today I noticed my classic was only showing 5kwh for the day but it must have reset late morning as its pumped out over 9kwh . So I am getting resets although less often during the day.

All my off-line data logs are also showing year 2003 although the date in the live logs is showing year 2013.

So its all a bit of a mess at the moment. I'm just kind of ignoring all the data from the classic keep tabs on battery voltage and instantaneous wattage from the local app as they are reliable.

Any accumulative data logging on the local App I don't trust at the moment. I am relying on my secondary data logger and shunts for the true accumulative data for the day until the bugs get ironed out in the classics logging.

  :( At least the classic  its still doing a great job at its primary function ...charging the battery's each day.  ;)

It can't be a easy job getting the classic and Apps perfect as the feature list grows and grows.

Kurt
Off grid system: 48v 16x400ah Calb lithium, Pv array one  NE facing  24 x 165w 3960w, Array two NW facing 21 x 200w 4200w total PV 8200w. Two x Classic 150,  Selectronic PS1 6000w inverter charger, Kubota J108 8kw diesel generator.

Wxboy

My Classic reset during the day a couple of days ago.  I was checking the logs on the mymidnite beta website that shows the logged data from my phone and it had about .4 or .5kwh showing when I left work.  When I got home the Classic showed 0.0 kwh so it reset somewhere late in the day.  This was probably a fluke occurrence but I will be keeping an eye on it.  I checked to make sure I didn't have the time set incorrectly like if it passed midnite and reset but the time was correct.
Midnite Classic 150, 765 watt array, Outback Radian GS4048A inverter, 200ah 48v agm battery bank

boB

Quote from: Wxboy on May 11, 2013, 11:41:03 PM
My Classic reset during the day a couple of days ago.  I was checking the logs on the mymidnite beta website that shows the logged data from my phone and it had about .4 or .5kwh showing when I left work.  When I got home the Classic showed 0.0 kwh so it reset somewhere late in the day.  This was probably a fluke occurrence but I will be keeping an eye on it.  I checked to make sure I didn't have the time set incorrectly like if it passed midnite and reset but the time was correct.


The recent history (Hourly, Minutely) logs should be able to tell you when it reset.  That will also tell you
what the kW-Hours were at the time of the reset since it logs every 5 or 10 minutes.
I like to go to the graph view and set it to look at battery voltage so I can see when it went to Float,
then I can move the cursor over to see what time it went back to bulk/absorb.  A resetting classic
will always come back up in bulk/absorb mode.

PS.  Not that knowing exactly when the reset happened is all that important.  It still needs to be fixed

boB
K7IQ 🌛  He/She/Me

justmeleep

It sounds like I can add myself to group seeing this issue.  About 45 minutes ago was twice (in about 5 days), and I can't tell anything happened, or has happened, except that the kwh reset back to zero (midday, it was about 13:00 here) and the classic stops floating and goes back into bulk/absorb for a while (going back through that cycle).  Both times so far, it happened during midday when the weather was ranging, often radically from temporarily sunny to rather overcast.  I'll presumably have to leave to the more experienced (that would be almost anyone...) to figure out, however.  I guess I just thought I'd "chime in".

aroxburgh

#66
Quote from: offgridQLD on May 11, 2013, 02:46:58 AM
  :( At least the classic  its still doing a great job at its primary function ...charging the battery's each day.  ;)
Kurt

Yes, at least the Classic still manages to do its main job, despite the resets!  ;D

Since watchdog (WD) timeouts have been confirmed by boB as the source of the resets, I'd like to ask him, if he would give us chickens out here in Classic PV land some insight into how the WD is hooked into the Classic code.

Of course, the job of a WD is to timeout and reset a digital system when the software or hardware (as the case may be) fails to "debark" or poke the WD within a timeout period (due to a hardware fault, a soft malfunction caused e.g., by ESD, or a software bug).  The best WDs consist of an independent hardware timer, analog or digital (embedded in a CPU or discrete) that works regardless of what software is loaded into the CPU.

Some points that come to mind, related to my own experience in the embedded arena:

1) Simple WDs are poked once per main loop (applicable to old-style code that uses an RTI interrupt to do sampling, and scheduling). Disadvantage of this approach is that often too much is happening in the main loop to easily know which part of it failed, and the various interrupt service routines are not protected.

2) Assuming that the Classic code is based on a modern threaded OS (perhaps even a mealtime OS), the WD would normally be poked by a critical thread, but for debug we can move the WD poke from one OS thread to another. We can also "thread" the WD poke itself (a different use of the word "thread") by dynamically changing which OS thread  does the poking. If we do this at a rate slower than the WD timeout, eventually (given enough time) we may get lucky and see a WD timeout-induced system reset. If we had previously used our debugger to setup a pushdown stack (consisting, e.g., of a spare piece of RAM with the beginning location defining a soft stack pointer), when we run our code, every time we poke the WD, we increment this pointer and write a small stack frame which identifies which OS thread did the writing, and a time stamp. Then, after a reset, we can use our debugger (or debug mode) to go and inspect the contents of the last stack frame written before the reset. This approach requires zero (or at least few) hardware changes.

3) It goes almost without saying that a good WD is protected by a "debark" or poke unlock-sequence that is long enough to be unlikely to occur by chance.

4) There should also be a way to run with the WD disabled, which begs the question of what happens to one of these classics that inherited this reset behavior coincident with a certain firmware update version, if the WD is disabled. (Perhaps the bug is in the WD circuit, or in the way in which the WD is now set up, compare to how it was set up, say seven months ago, when, at least for me, with a factory fresh firmware, the reset behavior did not exist).

5) A WD should ideally assert the system hardware reset line so that the system CPU always starts with the same known register state.

6) Since MidNite labs now have a Classic that has the resetting behavior, what happens when boB reverts it back to last year's firmware versions? Since identical resetting behavior has been conformed by many Classic users now, I'd rather, boB, that you reverted your hardware, than I mine (mine is now too busy charging batteries).

7) Has MidNite Solar considered making the Classic firmware open source? This could open the door to a lot of free help!    ;)
One company I know of (Flex Radio) sells software-defined radio transceivers that are open source except for some firmware components that they consider to be their "secret sauce."
Their original product, the SDR-1000 first shipped with a complete open source software package written in Visual Basic 6.  By releasing the SDRConsole under open source GPL licensing, many radio amateurs have been able to contribute to the radio's ongoing enhancement.

In early 2004, the VB SDR-1000 software was replaced by a completely new version re-written in C#.NET and C. This was a turning point for FlexRadio Systems, enabling the full potential of the SDR-1000 to be realized. Years later, through several generations of improved radio hardware that shares the same technical base, this software continues to be enhanced and upgraded by Flex Radio's software development team, with many new features and capabilities, in addition to ideas contributed or suggested by ustomers.

To some degree Flex was able to protect its commercial interests against the prying eyes of competitors by keeping a layer of proprietary firmware (the "secret sauce"...which boB would undoubtedly call secret sores) inside the radio hardware, independenet of the PC-based GPL software. This fiirmware is loaded completely separately from the GPL software, and is practically considered to be a part of the hardware.

A discussion of the Flex Radio's approach to open source software, which started out published under the GNU GPL License, can be found here: http://www.eham.net/ehamforum/smf/index.php?topic=69728.0;wap2

BTW, the "competitor" mentioned in the forum markets a small hobby radio receiver kit that does not begin to compare to Flex's own full-featured transceivers. In the end, as these hobby radio users begin to desire improved performance and features, I think that the competitor probably generates more Flex Radio hardware sales then fewer.

Of course the solar market is different from the hobby and amateur radio market, but I think there are still a enough similarities to warrant a useful comparison.

 
Best wishes and good luck!
Al Roxburgh
AJ4RF
Surveyor SV-235 travel trailer with 1.2 kW PV (6 x Grape Solar GS-3-195, Unirac Solarmount); MidNite Classic 150, MNBCM; 410 Ah @ 12 V (two Trojan L16RE-B); Magnum MS2812 2800 W pure sine inverter, ME-ARC50, BMK; Magnite E-Panel; power transfer cam switch; Dometic 459530 High Effiency Aircon

dgd

#67
Quote from: aroxburgh on May 21, 2013, 02:33:37 AM

Yes, at least the Classic still manages to do its main job, despite the resets!  ;D


Welllll, yes, overall yes but those zero down spikes take some 20 seconds each away from charging
But a small matter in the big scheme of classicdom

Quote

Some points that come to mind that related to my own experience in the embedded arena:
...
2) Assuming that the Classic code is based on a modern threaded OS (perhaps even a mealtime OS),

I asked this some time ago but boB indicates no OS in the classic, this surprised me but no further info about the classic coding was forthcoming..
Quote

6) Since you now have a Classic that has the resetting behavior, what happens when you revert it back to last year's firmware versions? Since the resetting has been conformed by many Classic users now, I'd rather you reverted your hardware, than I mine (mine is too busy charging batteries).

7) Has MidNite Solar considered making the Classic firmware open source? This could open the door to a lot of free help!   ;-)


Again this open source question has been asked many times and always elicits a negative response.
I think MN really see no value in their enthusiastic user base where more than a few users could offer software expertise in debugging and generally developing the MN software.
The local app debacle is just an example, a bug ridden application program that took years to slowly (and painfully) get to where it is now and still contains some whopper bugs eg '-In,Fin,ity' in an info box when it really should be 'No Data' :-\

Dgd

Classic 250, 150,  20 140w, 6 250w PVs, 2Kw turbine, MN ac Clipper, Epanel/MNdc, Trace SW3024E (1997), Century 1050Ah 24V FLA (1999). Arduino power monitoring and web server.  Off grid since 4/2000
West Auckland, New Zealand

Halfcrazy

As far as open source the issue is truly that we do not want to just give it to our competitors. However we would entertain working with and or paying someone someones that would be qualified to work on this. Any one interested please email me ryan@midnitesolar.com

Ryan
Changing the way wind turbines operate one smoke filled box at a time

boB

No open sores.    But maybe something else eventually might be useful to be open ?

The WDT timeout period is 5 seconds.  The feeding of the WD timer happens in a semi-threaded
manner, meaning that the code must go through at least 2 spots in different areas of the code
before that time is up.

The two main reasons for a WDT reset which is basically a start at 0 reset like a hardware
reset is if the code gets stuck in a loop somewhere.  That obviously keeps the WDT from
being fed (reset to 0 seconds).  Even if one of the main loops (there are a few of those)
runs, it will keep the WDT from being fed and will reset the Classic.

If you have the auto-reset enabled, this is basically what happens at 23:59 after the
daily logs and such have been saved.  This will also set the WDT bit in the info flags
and will be indistinguishable from a crash WDT reset.

Now, what has been happening with the Classic running under certain routers and
DHCP IP number giver-outers is that the Classic gets a data abort and goes into
a data abort loop which, in five seconds, will reset the Classic due to WDT timeout.

The latest firmware (not out yet ?) has a few modbus registers that may be
useful to look at after one of these WDT resets due to the data abort.

DabtU32Debug02   which is register  4342, 4343    (4343, 4344 if using the usual +1 modbus spec)

I have only found this to hold 0x 0000 A4ED  and I do not expect to get a different
result until another network code release.  This has already narrowed it down
considerably.

So you might keep those numbers handy for the next release so if you find something
happening, you can check that and give us a report that may very well help us out
in debugging.

like I say, I have only found this problem with certain routers so far as I can tell.
Mine at home is a Westell DSL router.  I know that Ross W in Australia has seen
this with his Free BSD server.

boB
K7IQ 🌛  He/She/Me

TomW

boB;

I switched from a Linksys WRT54 (GL) to a Dlink router and still get the resets.

Just FYI

When I get a bit more caught up in a couple days I will try one of the other routers I have probably the Belkin.

Tom

Do NOT mistake me for any kind of "expert".

( ͡° ͜ʖ ͡°)


24 Trina 310 watt modules, SMA SunnyBoy 7.7 KW Grid Tie inverter.

I thought that they were angels, but much to my surprise, We climbed aboard their starship and headed for the skies

dgd

Quote from: boB on May 21, 2013, 02:02:15 PM

The latest firmware (not out yet ?) has a few modbus registers that may be
useful to look at after one of these WDT resets due to the data abort.

DabtU32Debug02   which is register  4342, 4343    (4343, 4344 if using the usual +1 modbus spec)

I have only found this to hold 0x 0000 A4ED  and I do not expect to get a different
result until another network code release.  This has already narrowed it down
considerably.

With current release firmware these registers are hex 0000 0000 and after one of these WD resets are still 0000 0000 (as are all registers above 4327 which is 008e)

Dgd
Classic 250, 150,  20 140w, 6 250w PVs, 2Kw turbine, MN ac Clipper, Epanel/MNdc, Trace SW3024E (1997), Century 1050Ah 24V FLA (1999). Arduino power monitoring and web server.  Off grid since 4/2000
West Auckland, New Zealand

boB

Quote from: dgd on May 21, 2013, 05:14:35 PM
Quote from: boB on May 21, 2013, 02:02:15 PM

The latest firmware (not out yet ?) has a few modbus registers that may be
useful to look at after one of these WDT resets due to the data abort.

DabtU32Debug02   which is register  4342, 4343    (4343, 4344 if using the usual +1 modbus spec)

I have only found this to hold 0x 0000 A4ED  and I do not expect to get a different
result until another network code release.  This has already narrowed it down
considerably.

With current release firmware these registers are hex 0000 0000 and after one of these WD resets are still 0000 0000 (as are all registers above 4327 which is 008e)

Dgd


OK, so it may have to be a beta just for you guys then.

boB

K7IQ 🌛  He/She/Me

aroxburgh

Quote from: boB on May 21, 2013, 02:02:15 PM
No open sores.    But maybe something else eventually might be useful to be open ?
Perhaps the whole "back end" processing could be open? I've seen some very effective and eye-catching PC-based consoles for solar PV, that make good use of the entire PC screen. A large remote console with data analysis can be a lot more effective as the "face of your product" in the market, than a built-in small LCD and keypad, even though the integrated small console is still very useful for initial setup and validation.

BTW, one of the reasons (apart from the art deco look, which I love) for why I purchased the Classic 150 was MidNite's published networking vision, leveraging Ethernet for remote monitoring and control from anywhere. So, for me, you are headed in the right direction. Now just have to improve the reliability (MTB soft F) by a factor of a million or so....I don't want more than one glitch per year!!!     ;)

Quote
The WDT timeout period is 5 seconds.  The feeding of the WD timer happens in a semi-threaded
manner, meaning that the code must go through at least 2 spots in different areas of the code
before that time is up.
...Now, what has been happening with the Classic running under certain routers and
DHCP IP number giver-outers is that the Classic gets a data abort and goes into
a data abort loop which, in five seconds, will reset the Classic due to WDT timeout.
...
boB

...so if you can handle the data abort error condition with some additional code, you can get out of the infinite loop and prevent the WD timeout?

Al
AJ4RF
Surveyor SV-235 travel trailer with 1.2 kW PV (6 x Grape Solar GS-3-195, Unirac Solarmount); MidNite Classic 150, MNBCM; 410 Ah @ 12 V (two Trojan L16RE-B); Magnum MS2812 2800 W pure sine inverter, ME-ARC50, BMK; Magnite E-Panel; power transfer cam switch; Dometic 459530 High Effiency Aircon

aroxburgh

Quote from: boB on May 21, 2013, 02:02:15 PM
...
Now, what has been happening with the Classic running under certain routers and
DHCP IP number giver-outers is that the Classic gets a data abort...which, in five seconds, will reset the Classic due to WDT timeout.
...
boB

boB:

It is great to fix problems, but work-arounds can  be effective too, at least as a stop-gap measure.

Therefore, can you give us a list of the offending "certain routers", as well as a list of the proven good models, if any?
Also, does the data you've received from Classic users indicate that there is any dependency on DHCP vs static IP, or any other router or Classic settings?

This way we should be able to determine if the problem only occurs with certain routers/settings, or actually does occur with all routers.

Al
Surveyor SV-235 travel trailer with 1.2 kW PV (6 x Grape Solar GS-3-195, Unirac Solarmount); MidNite Classic 150, MNBCM; 410 Ah @ 12 V (two Trojan L16RE-B); Magnum MS2812 2800 W pure sine inverter, ME-ARC50, BMK; Magnite E-Panel; power transfer cam switch; Dometic 459530 High Effiency Aircon