1 |
/home/production/cvs/JSOC/doc/whattodolev0.txt 25Nov2008 |
2 |
|
3 |
------------------------------------------------ |
4 |
WARNING!! Some of this is outdated. 3Jun2010 |
5 |
Please see more recent what*.txt files, e.g. |
6 |
whattodo_start_stop_lev1_0_sums.txt |
7 |
------------------------------------------------ |
8 |
|
9 |
------------------------------------------------------ |
10 |
Running Datacapture & Pipeline Backend lev0 Processing |
11 |
------------------------------------------------------ |
12 |
|
13 |
|
14 |
NOTE: For now, this is all done from the xim w/s (Jim's office) |
15 |
|
16 |
Datacapture: |
17 |
-------------------------- |
18 |
|
19 |
NOTE:IMPORTANT: Please keep in mind that each data capture machine has its |
20 |
own independent /home/production. |
21 |
|
22 |
FORMERLY: 1. The Datacapture system for aia/hmi is by convention dcs0/dcs1 |
23 |
respectively. If the spare dcs2 is to be put in place, it is renamed dcs0 |
24 |
or dcs1, and the original machine is renamed dcs2. |
25 |
|
26 |
1. The datacapture machine serving for AIA or HMI is determined by |
27 |
the entries in: |
28 |
|
29 |
/home/production/cvs/JSOC/proj/datacapture/scripts/dsctab.txt |
30 |
|
31 |
This is edited or listed by the program: |
32 |
|
33 |
/home/production/cvs/JSOC/proj/datacapture/scripts> dcstab.pl -h |
34 |
Display or change the datacapture system assignment file. |
35 |
Usage: dcstab [-h][-l][-e] |
36 |
-h = print this help message |
37 |
-l = list the current file contents |
38 |
-e = edit with vi the current file contents |
39 |
|
40 |
For dcs3 the dcstab.txt would look like: |
41 |
AIA=dcs3 |
42 |
HMI=dcs3 |
43 |
|
44 |
|
45 |
1a. The spare dcs2 normally servers as a backup destination of the postgres |
46 |
running on dcs0 and dcs1. You should see this postgres cron job on dcs0 |
47 |
and dcs1, respectively: |
48 |
|
49 |
0,20,40 * * * * /var/lib/pgsql/rsync_pg_dcs0_to_dcs2.pl |
50 |
0,20,40 * * * * /var/lib/pgsql/rsync_pg_dcs1_to_dcs2.pl |
51 |
|
52 |
For this to work, this must be done on dcs0, dcs1 and dcs2, as user |
53 |
postgres, after any reboot: |
54 |
|
55 |
> ssh-agent | head -2 > /var/lib/pgsql/ssh-agent.env |
56 |
> chmod 600 /var/lib/pgsql/ssh-agent.env |
57 |
> source /var/lib/pgsql/ssh-agent.env |
58 |
> ssh-add |
59 |
(The password is written on my whiteboard (same as production's)) |
60 |
|
61 |
2. Login as user production via j0. (password is on Jim's whiteboard). |
62 |
|
63 |
3. The Postgres must be running and is started automatically on boot: |
64 |
|
65 |
> ps -ef |grep pg |
66 |
postgres 4631 1 0 Mar11 ? 00:06:21 /usr/bin/postmaster -D /var/lib/pgsql/data |
67 |
|
68 |
4. The root of the datacapture tree is /home/production/cvs/JSOC. |
69 |
The producton runs as user id 388. |
70 |
|
71 |
5. The sum_svc is normally running: |
72 |
|
73 |
> ps -ef |grep sum_svc |
74 |
388 26958 1 0 Jun09 pts/0 00:00:54 sum_svc jsocdc |
75 |
|
76 |
Note the SUMS database is jsocdc. This is a separate DB on each dcs. |
77 |
|
78 |
6. To start/restart the sum_svc and related programs (e.g. tape_svc) do: |
79 |
|
80 |
> sum_start_dc |
81 |
sum_start at 2008.06.16_13:32:23 |
82 |
** NOTE: "soc_pipe_scp jsocdc" still running |
83 |
Do you want me to do a sum_stop followed by a sum_start for you (y or n): |
84 |
|
85 |
You would normally answer 'y' here. |
86 |
|
87 |
7. To run the datacapture gui that will display the data, mark it for archive, |
88 |
optionally extract lev0 and send it on the the pipeline backend, do this: |
89 |
|
90 |
> cd /home/production/cvs/JSOC/proj/datacapture/scripts> |
91 |
> ./socdc |
92 |
|
93 |
All you would normally do is hit "Start Instances for HMI" or AIA for |
94 |
what datacapture machine you are on. |
95 |
|
96 |
8. To optionally extract lev0 do this: |
97 |
|
98 |
> touch /usr/local/logs/soc/LEV0FILEON |
99 |
|
100 |
To stop lev0: |
101 |
|
102 |
> /bin/rm /usr/local/logs/soc/LEV0FILEON |
103 |
|
104 |
The last 100 images for each VC are kept in /tmp/jim. |
105 |
|
106 |
NOTE: If you turn lev0 on, you are going to be data sensitive and you |
107 |
may see things like this, in which case you have to restart socdc: |
108 |
|
109 |
ingest_tlm: /home/production/cvs/EGSE/src/libhmicomp.d/decompress.c:1385: decompress_undotransform: Assertion `N>=(6) && N<=(16)' failed. |
110 |
kill: no process ID specified |
111 |
|
112 |
9. The datacapture machines automatically copies DDS input data to the |
113 |
pipeline backend on /dds/socdc living on d01. This is done by the program: |
114 |
|
115 |
> ps -ef |grep soc_pipe_scp |
116 |
388 21529 21479 0 Jun09 pts/0 00:00:13 soc_pipe_scp /dds/soc2pipe/hmi /dds/socdc/hmi d01i 30 |
117 |
|
118 |
This requires that an ssh-agent be running. If you reboot a dcs machine do: |
119 |
|
120 |
> ssh-agent | head -2 > /var/tmp/ssh-agent.env |
121 |
> chmod 600 /var/tmp/ssh-agent.env |
122 |
> source /var/tmp/ssh-agent.env |
123 |
> ssh-add (or for sonar: ssh-add /home/production/.ssh/id_rsa) |
124 |
(The password is written on my whiteboard) |
125 |
|
126 |
NOTE: cron jobs use this /var/tmp/ssh-agent.env file |
127 |
|
128 |
If you want another window to use the ssh-agent that is already running do: |
129 |
> source /var/tmp/ssh-agent.env |
130 |
|
131 |
NOTE: on any one machine for user production there s/b just one ssh-agent |
132 |
running. |
133 |
|
134 |
|
135 |
If you see that a dcs has asked for a password, the ssh-agent has failed. |
136 |
You can probably find an error msg on d01 like 'invalid user production'. |
137 |
You should exit the socdc. Make sure there is no soc_pipe_scp still running. |
138 |
Restart the socdc. |
139 |
|
140 |
If you find that there is a hostname for production that is not in the |
141 |
/home/production/.ssh/authorized_keys file then do this on the host that |
142 |
you want to add: |
143 |
|
144 |
Pick up the entry in /home/production/.ssh/id_rsa.pub |
145 |
and put it in this file on the host that you want to have access to |
146 |
(make sure that it's all one line): |
147 |
|
148 |
/home/production/.ssh/authorized_keys |
149 |
|
150 |
NOTE: DO NOT do a ssh-keygen or you will have to update all the host's |
151 |
authorized_keys with the new public key you just generated. |
152 |
|
153 |
If not already active, then do what's shown above for the ssh-agent. |
154 |
|
155 |
|
156 |
10. There should be a cron job running that will archive to the T50 tapes. |
157 |
Note the names are asymmetric for dcs0 and dcs1. |
158 |
|
159 |
30 0-23 * * * /home/production/cvs/jsoc/scripts/tapearc_do |
160 |
|
161 |
00 0-23 * * * /home/production/cvs/jsoc/scripts/tapearc_do_dcs1 |
162 |
|
163 |
In the beginning of the world, before any sum_start_dc, the T50 should have |
164 |
a supply of blank tapes in it's active slots (1-24). A cleaning tape must |
165 |
be in slot 25. The imp/exp slots (26-30) must be vacant. |
166 |
To see the contents of the T50 before startup do: |
167 |
|
168 |
> mtx -f /dev/t50 status |
169 |
|
170 |
Whenever sum_start_dc is called, all the tapes are inventoried and added |
171 |
to the SUMS database if necessary. |
172 |
When a tape is written full by the tapearc_do cron job, the t50view |
173 |
display (see 11. and 12. below) 'Imp/Exp' button will increment its |
174 |
count. Tapes should be exported before the count gets above 5. |
175 |
|
176 |
11. There should be running the t50view program to display/control the |
177 |
tape operations. |
178 |
|
179 |
> t50view -i jsocdc |
180 |
|
181 |
The -i means interactive mode, which will allow you to change tapes. |
182 |
|
183 |
12. Every 2 days, inspect the t50 display for the button on the top row |
184 |
called 'Imp/Exp'. If it is non 0 (and yellow), then some full tapes can be |
185 |
exported from the T50 and new tapes put in for further archiving. |
186 |
|
187 |
Hit the 'Imp/Exp' button. |
188 |
Follow explicitly all the directions. |
189 |
The blank L4 tapes are in the tape room in the computer room. |
190 |
|
191 |
When the tape drive needs cleaning, hit the "Start Cleaning" button on |
192 |
the t50view gui. |
193 |
|
194 |
13. There should be a cron job running as user production on both dcs0 and |
195 |
dcs1 that will set the Offsite_Ack field in the sum_main DB table. |
196 |
20 0 * * * /home/production/tape_verify/scripts/set_sum_main_offsite_ack.pl |
197 |
|
198 |
Where: |
199 |
#/home/production/tape_verify/scripts/set_sum_main_offsite_ack.pl |
200 |
# |
201 |
#This reads the .ver files produced by Tim's |
202 |
#/home/production/tape_verify/scripts/run_remote_tape_verify.pl |
203 |
#A .ver file looks like: |
204 |
## Offsite verify offhost:dds/off2ds/HMI_2008.06.11_01:12:27.ver |
205 |
## Tape 0=success 0=dcs0(aia) |
206 |
#000684L4 0 1 |
207 |
#000701L4 0 1 |
208 |
##END |
209 |
#For each tape that has been verified successfully, this program |
210 |
#sets the Offsite_Ack to 'Y' in the sum_main for all entries |
211 |
#with Arch_Tape = the given tape id. |
212 |
# |
213 |
#The machine names where AIA and HMI processing live |
214 |
#is found in dcstab.txt which must be on either dcs0 or dcs1 |
215 |
|
216 |
14. Other background info is in: |
217 |
|
218 |
http://hmi.stanford.edu/development/JSOC_Documents/Data_Capture_Documents/DataCapture.html |
219 |
|
220 |
***************************dsc3********************************************* |
221 |
NOTE: dcs3 (i.e. offsite datacapture machine shipped to Goddard Nov 2008) |
222 |
|
223 |
At Goddard the dcs3 host name will be changed. See the following for |
224 |
how to accomodate this: |
225 |
|
226 |
/home/production/cvs/JSOC/doc/dcs3_name_change.txt |
227 |
|
228 |
This cron job must be run to clean out the /dds/soc2pipe/[aia,hmi]: |
229 |
0,5,10,15,20,25,30,35,40,45,50,55 * * * * |
230 |
/home/production/cvs/JSOC/proj/datacapture/scripts/rm_soc2pipe.pl |
231 |
|
232 |
Also on dcs3 the offsite_ack check and safe tape check is not done in: |
233 |
/home/production/cvs/JSOC/base/sums/libs/pg/SUMLIB_RmDo.pgc |
234 |
|
235 |
Also on dcs3, because there is no pipeline backend, there is not .arc file |
236 |
ever made for the DDS. |
237 |
***************************dsc3********************************************* |
238 |
|
239 |
Level 0 Backend: |
240 |
-------------------------- |
241 |
|
242 |
!!Make sure run Phil's script for watchlev0 in the background on cl1n001: |
243 |
/home/production/cvs/JSOC/base/sums/scripts/get_dcs_times.csh |
244 |
|
245 |
1. As mentioned above, the datacapture machines automatically copies DDS input |
246 |
data to the pipeline backend on /dds/socdc living on d01. |
247 |
|
248 |
2. The lev0 code runs as ingest_lev0 on the cluster machine cl1n001, |
249 |
which has d01:/dds mounted. cl1n001 can be accessed through j1. |
250 |
|
251 |
3. All 4 instances of ingest_lev0 for the 4 VCs are controlled by |
252 |
/home/production/cvs/JSOC/proj/lev0/apps/doingestlev0.pl |
253 |
|
254 |
If you want to start afresh, kill any ingest_lev0 running (will later be |
255 |
automated). Then do: |
256 |
|
257 |
> cd /home/production/cvs/JSOC/proj/lev0/apps |
258 |
> doingestlev0.pl (actually a link to start_lev0.pl) |
259 |
|
260 |
You will see 4 instances started and the log file names can be seen. |
261 |
You will be advised that to cleanly stop the lev0 processing, run: |
262 |
|
263 |
> stop_lev0.pl |
264 |
|
265 |
It may take awhile for all the ingest_lev0 processes to get to a point |
266 |
where they can stop cleanly. |
267 |
|
268 |
For now, every hour, the ingest_lev0 processes are automatically restarted. |
269 |
|
270 |
|
271 |
4. The output is for the series: |
272 |
|
273 |
hmi.tlmd |
274 |
hmi.lev0d |
275 |
aia.tlmd |
276 |
aia.lev0d |
277 |
|
278 |
#It is all save in DRMS and archived. |
279 |
Only the tlmd is archived. (see below if you want to change the |
280 |
archiving status of a dataseries) |
281 |
|
282 |
5. If something in the backend goes down such that you can't run |
283 |
ingest_lev0, then you may want to start this cron job that will |
284 |
periodically clean out the /dds/socdc dir of the files that are |
285 |
coming in from the datacapture systems. |
286 |
|
287 |
> crontab -l |
288 |
# DO NOT EDIT THIS FILE - edit the master and reinstall. |
289 |
# (/tmp/crontab.XXXXVnxDO9 installed on Mon Jun 16 16:38:46 2008) |
290 |
# (Cron version V5.0 -- $Id: whattodolev0.txt,v 1.8 2009/08/03 18:24:23 production Exp $) |
291 |
#0,20,40 * * * * /home/jim/cvs/jsoc/scripts/pipefe_rm |
292 |
|
293 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
294 |
|
295 |
Starting and stoping SUMS on d02: |
296 |
|
297 |
Login as production on d02 |
298 |
sum_start_d02 |
299 |
|
300 |
(if sums is already running it will ask you if you want to halt it. |
301 |
you normally say 'y'.) |
302 |
|
303 |
sum_stop_d02 |
304 |
if you just want to stop sums. |
305 |
|
306 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
307 |
|
308 |
SUMS archiving: |
309 |
|
310 |
Currently SUM is archiving continuously. The script is: |
311 |
|
312 |
/home/production/cvs/JSOC/base/sums/scripts/tape_do_0.pl (and _1, _2, _3) |
313 |
|
314 |
To halt it do: |
315 |
|
316 |
touch /usr/local/logs/tapearc/TAPEARC_ABORT[0,1,2] |
317 |
|
318 |
Try to keep it running, as there is still much to be archived. |
319 |
|
320 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
321 |
|
322 |
Change archiving status of a dataseries: |
323 |
|
324 |
> psql -h hmidb jsoc |
325 |
|
326 |
jsoc=> update hmi.drms_series set archive=0 where seriesname='hmi.lev0c'; |
327 |
UPDATE 1 |
328 |
jsoc=> \q |
329 |
|
330 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
331 |
|
332 |
The modified dcs reboot procedure is in ~kehcheng/dcs.reboot.notes. |