The blackbox project

Started by zoneblue, September 15, 2013, 08:48:04 PM

Previous topic - Next topic

zoneblue

Just to give an idea where this is heading heres a screenshot of the default view screen. It updates every minute. By clicking setup you can specify which datapoints you want to see where, and which series to plot on graphs. You can add your own templates to represent the data pretty much anyway you want.

It takes the cubieboard 2 seconds to render that page. And thats with most of the heavy lifting having been done previously by a cronjob. The plan is to remove the database, because it isnt adding anything except overhead.

The code and a demo will be up in the next few days. At that point its  wireframe / proof of concept only and anyone that wants to help shape its evolution will be more than welcome to chip in.

PS. see the flat spot betw 7-8am. Thats when the classic's ethernet locked up, only fixable by rebooting the classic.

6x300W CSUN, ground mount, CL150Lite, 2V/400AhToyo AGM,  Outback VFX3024E, Steca Solarix PL1100
http://www.zoneblue.org/cms/page.php?view=off-grid-solar

atop8918

That looks fantastic! Excellent work. I'm sorry about the Classic Ethernet issues.

Halfcrazy

Zoneblue what modbus library are you using? And not sure how well its documented but the older code 1401 and back had a 30 second inactivity timer on the tcp/ip stack after no modbus reads for 30 seconds it will close the connection. The new Beta code on the website has a 5 minute inactivity timer.

Ryan
Changing the way wind turbines operate one smoke filled box at a time

zoneblue

Project site is now up: http://code.google.com/p/theblackboxproject/

For just now youll need svn. Remember the code is stilll way, way, pre release.

re. firmware, Im still using 1370. I will try 1549 if you think it will help.
6x300W CSUN, ground mount, CL150Lite, 2V/400AhToyo AGM,  Outback VFX3024E, Steca Solarix PL1100
http://www.zoneblue.org/cms/page.php?view=off-grid-solar

zoneblue

Quote from: Halfcrazy on September 16, 2013, 07:26:08 AM
Zoneblue what modbus library are you using?
I have tried them phpmodbus, pymodbus and newmodbus. The latter is by far the least affected by this issue. The reason is that it is  the lightest and quickest and the issue to my mind is a timing thing.

I dont know if you recall but back with 1070 and the local app at the time it had these exact same symtpoms, random disconnects/fail to connect /timeouts. Something about timing has been tweaked because with 1370 and the local app of its era, the locall app disconnects went away.

If you want to reproduce this issue, its most likely that installing the black box on your lab rPi will do it pretty quick.

Quote
And not sure how well its documented but the older code 1401 and back had a 30 second inactivity timer on the tcp/ip stack after no modbus reads for 30 seconds it will close the connection. The new Beta code on the website has a 5 minute inactivity timer.
Ryan

Wasnt that for outgoing http posts, for dns timeouts etc? Im stuggling to see how that is related to incoming modbus connects? But i will try it if it helps progress this.
6x300W CSUN, ground mount, CL150Lite, 2V/400AhToyo AGM,  Outback VFX3024E, Steca Solarix PL1100
http://www.zoneblue.org/cms/page.php?view=off-grid-solar

atop8918

The 30 second disconnect is a safety thing. TCP/IP has no built-in keep-alive mechanism (although some stacks implement it as a hack). That is left up to the application stack. MODBUS also does not specify a keep-alive mechanism. The Classic, since it only has one precious listening port, checks for activity on the port. If it is silent for more than 30 seconds then it automatically terminates the connection. This means you must ask for at least one register (valid or not) at least once every 30 seconds or your application will be disconnected.

zoneblue

#6
Interesting. An easy wasy for me to test that is to shorten my sample interval from 60s to 20s.

Id assumed i was making  brnad new connections every 60 seconds. So, when 60s later you make a new request, what happens? Im getting a mixture of connect refused, and timeouts.  Whats making those odd connection fails, and then later the ethernet port shut scompletely?

Out of interest , how frequently can you make modbus over ethernet requests? Without impacting on the controllers resources.

6x300W CSUN, ground mount, CL150Lite, 2V/400AhToyo AGM,  Outback VFX3024E, Steca Solarix PL1100
http://www.zoneblue.org/cms/page.php?view=off-grid-solar

RossW

Quote from: zoneblue on September 17, 2013, 06:24:34 PM
Id assumed i was making  brnad new connections every 60 seconds. So, when 60s later you make a new request, what happens? Im getting a mixture of connect refused, and timeouts.  Whats making those odd connection fails, and then later the ethernet port shut scompletely?

For what it's worth, my newmodbus when run in loop (continuous) mode will leave the connection open and just keep making queries at the appropriate time. You can force it to close and re-open the connection with the -r switch (Re-open) which instructs it to open, read/write, close. This was mainly intended to help it co-exist with a local-app on the same network, although I don't know if the local-app closes the connection or just hogs it :)
3600W on 6 tracking arrays.
7200W on 2 fixed array.
Midnite Classic 150
Outback Flexmax FM80
16 x LiFePO4 600AH cells
16 x LiFePO4 300AH cells
Selectronics SP-PRO 481 5kW inverter
Fronius 6kW AC coupled inverter
Home-brew 4-cyl propane powered 14kVa genset
2kW wind turbine

atop8918

The Local App is a resource hog! It grabs the connection and tries never to let go. If a communications error causes it to shut down the connection then it makes a grab for it again as soon as possible. Greedy, greedy, greedy. RossW's connect-poll-disconnect architecture is a much more altruistic method and definitely the one to emulate. I would attempt to do the same on the Local app but it would mean tearing it apart and leaving it scattered all over the yard for a few months. By then I'd be too lazy to put it all back together and I'd by a used Miata kit instead.

As far as the Classic goes, you can poll for registers as fast as it can give them to you. I've run modscan down to 50ms. If you are opening a connection and leaving it open and then opening another you might be running into problems on the PC side rather than the Classic. The Classic _should_ just ignore the new connection request although different IP stacks sometimes employ sneaky ways to take over a connection some of which the Classic's tiny stack can't currently cope with.

The recommended way to talk to the Classic is to open a connection to it, run your register poll, then close the connection.
You can also do what the Local App does which is to open a connection and then poll for registers at least once per every 30 seconds down to once every 50ms if you need that resolution. You might start getting modbus errors at that speed though depending on how loaded the communications stack is.

If you need to write registers then the second method is a little easier since you will have to write the unlock registers every time you open a new connection to the Classic. You can disable this behavior by installing the "Lock" jumper, though.

Keep in mind that you should be doing full error checking when doing TCP/IP communications. There are a lot of things that can go wrong and if you aren't keeping track of IOExceptions/Network Errors/TCP Errors/etc, then you may be doing things like attempting to write data to a closed or partially open connection. This may have bad consequences on the Classic if your modbus engine is trying to alert you that there is an error on the bus but you are still trying to write data to a partially open connection.

For example if you open a connection to the Classic and then wait 60 seconds the Classic will timeout after 30 seconds and attempt to close the connection. If you ignore the close indication in your application and simply try to write to the existing connection then the behavior is undefined. Your PC stack _should_ close the connection as per the Classic's request but some leave it partially open as an optimization in case you want to re-open it later. Case in point: the Local Application keeps its connection open partially because you cannot fully close a TCP socket under AIR SDK 2.5 without shutting down the application. What this means is that as far as the Classic is concerned a connection may be closed, but the PC still keeps it open in order to speed up the process of re-opening it. It does alert the application code by throwing an IOError event which indicates that the connection has been closed by the other end which I then use to mark the connection "Closed" in my application code and then stop writing to the socket. If I ignore that "IOError" event then my application can still write to the socket, the stack assumes I know what I'm doing and then puts the data on the bus. If the Classic has not finished its internal socket cleanup code then some of the data may get processed which leaves the socket in a bad state. I have attempted to address this issue in the firmware, but there are still some small border cases in which the connection can still get confused. If your code is written in Java or uses .NET or Python (or AIR) then this is a possibility.

It is best for your application to honor and process the error signals coming from your modbus stack. If the modbus stack is not sending Close or error events then you should probably use a different stack.

zoneblue

#9
Andy,

Thanks for your consdered response on this. I see that for connections that are kept open, there are some factors to consider there for sure. Are you saying,  that in general we /should/ be aiming to keep a connection open?

The model i am using is the same cautious approach that newmodbus takes by default. Open, get the data, close. Repeat 60s later. Same reason, to allow local app to work, for configuration purposes.  Is there something about this methodology that is problematic for the classic? Does it take a long time to reissue new connections?

Newmodbus is very very fast, an order of 10 times faster than any other method. The fact that the data doenst hit the db until 2 or 3 seconds past the minute is caused solely by my post processing.

Its not ideal that when the local app is running i lose data, but in the scheme of things its something we can live with.

But something is going on somewhere. The current blackbox trunk code results, on the A10 cubiebaord, in 1 failed connect per hour roughly, and a complete lockout about once every 30 hours. You can tell by the timestamp in the db whether it timed out, or was refused. Lately they are more the refused tpye. Refused entrys appear as 2 seconds past the minute, timed out entrys appear 32 seconds past the minite, as newmodbus appears to retry for 30 seconds then giveup.

As i have indicated is very much depends on the load on the board as to how bad the problem is. I have i believe ruled out a bug in newmodbus, something like a connection left open,  as my test using /modules/midnite_classic/readclassic.py exhibits refused connections at about 4 times worse a rate.

Heres the php from /modules/midnite_classic.php: line 501 ($binary is newmodbus, and that function is called from cron each minute, so its a new process each time):


//invoke the binary and parse the results
if ($this->debug) print "\nInvoke binary: $dir/$binary $ip";
exec("$dir/$binary $ip 16385-16390 4101-4340", $lines,$ret);

if ($ret) {
    $this->error= true;
    if ($this->debug) print "\nRead device: FAIL";
    return false;
}
else {
    foreach ($lines as $line) {
        if (!preg_match("/^\d\d\d\d\d? /",$line)) continue;
        list($register,$value)= explode(" ",$line);
        $this->registers[$register]= $value;
    }
    if ($this->debug) print "\nRead registers: ".count($this->registers);
}


Just to repeat for clarity. On an intel atom dual core mini itx board, this problem does not occur at all. But I have previously seen issues with local app disconnects on netbook computers which i believe is directly related.  On the cubieboard, the failed connects occur when the board is loaded. If there are no web pages being served by lighttp for instance the script talking to classic gets left alone to run more or less happily. The cronjob is the only script that talks to the classic, the webpages deliver data that had been previously stored in the database. That combined with the 2s per 60s cron constraint, means there are no chances of overlapping connect attempts.

I have not tried using newmodbus's regular write to disk  function.

Edit: i just had a thought. The once per hour connect refused may be related to a rolling overlap between newmodbus running and a web browser page render. Ie the view page refreshes every minute, but becasue of page delivering overhead, its going to take say 61 seconds, therefore periodically overlap the cron 60s, about once an hour... That means, other than the extra load on the cpu, extra *network traffic*. Could this be a clue?  Ill leave the view screen off today, to confirm.
6x300W CSUN, ground mount, CL150Lite, 2V/400AhToyo AGM,  Outback VFX3024E, Steca Solarix PL1100
http://www.zoneblue.org/cms/page.php?view=off-grid-solar

zoneblue

Bang goes that theory, it seems particularly bad this morning:


id state tbat tcc pout vpv ipv vout iout date_created infoflags code
----------------------------------------------------------------------------------------------------------
36140 4 10.1 34.0 478.0 76.2 6.1 27.2 17.4 2013-09-19 09:38:04 838873092 0
36139 4 10.1 33.8 229.0 83.6 2.5 27.1 8.4 2013-09-19 09:37:04 838873092 0
36138 4 10.1 33.6 523.0 76.6 6.7 27.2 19.1 2013-09-19 09:36:04 838873092 0
36137 4 10.0 33.3 276.0 82.1 3.2 26.9 10.2 2013-09-19 09:35:04 838873092 0
36136 4 10.0 33.0 342.0 76.9 4.3 26.9 12.7 2013-09-19 09:34:04 838873092 0
36135 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2013-09-19 09:33:04 0 1
36134 4 10.0 32.9 326.0 76.2 4.1 26.9 12.1 2013-09-19 09:32:03 838873092 0
36133 4 10.0 32.9 450.0 74.3 5.9 27.1 16.5 2013-09-19 09:31:04 838873092 0
36132 4 10.0 32.7 425.0 76.3 5.5 27.0 15.7 2013-09-19 09:30:04 838873092 0
36131 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2013-09-19 09:29:04 0 1
36130 4 10.0 32.2 297.0 74.6 3.8 26.6 11.1 2013-09-19 09:28:03 838873092 0
36129 4 10.0 32.3 235.0 75.7 2.9 26.6 8.8 2013-09-19 09:27:04 838873092 0
36128 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2013-09-19 09:26:04 0 1
36127 4 10.0 32.2 286.0 72.9 3.7 26.7 10.7 2013-09-19 09:25:03 838873092 0
36126 4 10.0 32.1 266.0 74.1 3.4 26.7 9.9 2013-09-19 09:24:04 838873092 0
36125 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2013-09-19 09:23:04 0 1
36124 4 10.0 32.1 427.0 74.9 5.5 27.0 15.7 2013-09-19 09:22:03 838873092 0
36123 4 10.0 32.0 406.0 75.5 5.2 26.9 15.0 2013-09-19 09:21:04 838873092 0
36122 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2013-09-19 09:20:04 0 1
36121 4 10.0 31.8 376.0 75.8 4.8 26.9 13.9 2013-09-19 09:19:03 838873092 0
36120 4 10.0 31.7 383.0 77.3 4.8 26.9 14.2 2013-09-19 09:18:04 838873092 0
36119 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2013-09-19 09:17:04 0 1
36118 4 10.0 31.2 329.0 69.3 0.0 26.8 0.0 2013-09-19 09:16:03 838873092 0
36117 4 10.0 31.2 356.0 74.2 4.6 26.8 13.2 2013-09-19 09:15:04 838873092 0
36116 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2013-09-19 09:14:04 0 1
36115 4 10.0 30.9 219.0 76.5 0.0 26.8 8.2 2013-09-19 09:13:03 838873092 0
36114 4 9.9 30.7 222.0 78.1 0.0 26.8 8.3 2013-09-19 09:12:03 838873092 0
36113 4 10.0 30.6 373.0 76.1 4.8 26.8 13.9 2013-09-19 09:11:04 838873092 0
36112 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2013-09-19 09:10:04 0 1
36111 4 9.9 30.1 127.0 82.8 0.0 26.5 0.0 2013-09-19 09:09:03 838873092 0
36110 4 10.0 30.0 327.0 73.7 4.2 26.5 12.3 2013-09-19 09:08:03 838873092 0
36109 4 9.9 29.8 94.0 81.6 0.0 26.3 3.6 2013-09-19 09:07:03 838873092 0
36108 4 9.9 29.6 89.0 83.0 1.1 26.3 3.4 2013-09-19 09:06:03 838873092 0
36107 4 9.9 29.4 81.0 82.1 0.9 26.1 3.1 2013-09-19 09:05:03 838873092 0
36106 4 9.9 29.3 78.0 83.4 0.8 26.0 3.0 2013-09-19 09:04:03 838873092 0
36105 4 9.9 29.0 65.0 82.9 0.6 25.9 2.4 2013-09-19 09:03:03 838873092 0
36104 4 9.8 28.9 62.0 81.6 0.5 25.8 2.4 2013-09-19 09:02:03 838873092 0
36103 4 9.9 28.8 51.0 80.9 0.4 25.7 2.0 2013-09-19 09:01:03 838873092 0
36102 4 9.9 28.8 72.0 80.6 0.6 25.8 2.8 2013-09-19 09:00:03 838873092 0
36101 4 9.8 28.8 145.0 72.5 1.8 25.8 5.6 2013-09-19 08:59:03 838873092 0
36100 4 9.9 28.8 150.0 72.5 1.9 25.7 5.8 2013-09-19 08:58:04 838873092 0
36099 4 9.8 28.8 145.0 73.0 1.7 25.7 5.6 2013-09-19 08:57:04 838873092 0
36098 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2013-09-19 08:56:04 0 1
36097 4 9.9 28.7 108.0 78.1 1.2 25.7 4.2 2013-09-19 08:55:03 838873092 0
36096 4 9.9 28.6 165.0 74.9 2.0 25.6 6.3 2013-09-19 08:54:04 838873092 0
36095 4 9.9 28.5 138.0 74.3 1.6 25.6 5.4 2013-09-19 08:53:04 838873092 0
6x300W CSUN, ground mount, CL150Lite, 2V/400AhToyo AGM,  Outback VFX3024E, Steca Solarix PL1100
http://www.zoneblue.org/cms/page.php?view=off-grid-solar

atop8918

I'm not sure what is happening then. I wonder if RossW or Tom could chime in if they've seen similar issues?


RossW

Quote from: atop8918 on September 19, 2013, 03:53:33 AM
I'm not sure what is happening then. I wonder if RossW or Tom could chime in if they've seen similar issues?

I've been running newmodbus since shortly after I got my classic late last year. It polls my classic every 300 seconds from one of my FreeBSD boxes (a non-descript, low-power 1.2GHz box) but I have never, not once, experienced this problem without direct and deliberate provocation (like intentionally running multiple instances concurrently).

The only problem I've experienced has been the 2 to 9 times a day spontaneous reset, which "went away" after my classic blew up and we kludged it back into operation (a) at reduced power and (b) in legacy mode.
3600W on 6 tracking arrays.
7200W on 2 fixed array.
Midnite Classic 150
Outback Flexmax FM80
16 x LiFePO4 600AH cells
16 x LiFePO4 300AH cells
Selectronics SP-PRO 481 5kW inverter
Fronius 6kW AC coupled inverter
Home-brew 4-cyl propane powered 14kVa genset
2kW wind turbine

TomW

Quote from: atop8918 on September 19, 2013, 03:53:33 AM
I'm not sure what is happening then. I wonder if RossW or Tom could chime in if they've seen similar issues?

Wish I could offer insight. I have no idea what that bunch of data means without knowing what it represents? I left my crystal ball in my other pants.

When mine locks up comms it simply refuses any connections. Ping gets replies but no access otherwise. It doesn't do it X times, its locked until reboot.

I will say it again in case folks missed it . I have one that loses comms regularly, say about every 24 hours and one that has done it before but has not for a very long time. The units are the same firmware but the one that is apparently fine has a SN that is +6000 from the one that is a problem.

Sorry I have no clue on what this data means.

Tom


Do NOT mistake me for any kind of "expert".

( ͡° ͜ʖ ͡°)


24 Trina 310 watt modules, SMA SunnyBoy 7.7 KW Grid Tie inverter.

I thought that they were angels, but much to my surprise, We climbed aboard their starship and headed for the skies

zoneblue

#14
The table i posted above, the final coumns "code" is the return value from newmodbus. 0 for ok, 1 for any positive value indicating a failed connect. Here itll drop occasional conects before closing completely.

Ross, do you have a list of return values for newmodbus?

As to why yesterday was worse, the only thing that i can think of was that ive had other network traffic running through the main switch, not heavy, but ongoing.

Ive been meaning to add a second network switch closer to tthe classic, so ill try that soon as well.

Ross, your pattern of non problems matches when i replicate this whole setup on a bigger computer. I had assumed yours was also a "bigger computer" but if its 1.2G maybe not much so, what is it, how many cores etc. Also, are you running it the same way as Tom, newmodbus -wi ? Youre interval is longer if 5mins.  Its something ive pondered trying some other cron intevals lke 5mins. And i guess trying andrews method of keeping the connetion open. Other than that im out ideas as well.

Bob mentioned something on the resets thread yesterday about another new firmware coming.

6x300W CSUN, ground mount, CL150Lite, 2V/400AhToyo AGM,  Outback VFX3024E, Steca Solarix PL1100
http://www.zoneblue.org/cms/page.php?view=off-grid-solar