1 |
/home/production/cvs/JSOC/doc/whattodolev0.txt 25Nov2008 |
2 |
|
3 |
------------------------------------------------ |
4 |
WARNING!! Some of this is outdated. 3Jun2010 |
5 |
Please see more recent what*.txt files, e.g. |
6 |
whattodo_start_stop_lev1_0_sums.txt |
7 |
------------------------------------------------ |
8 |
|
9 |
------------------------------------------------------ |
10 |
Running Datacapture & Pipeline Backend lev0 Processing |
11 |
------------------------------------------------------ |
12 |
|
13 |
|
14 |
NOTE: For now, this is all done from the xim w/s (Jim's office) |
15 |
|
16 |
Datacapture: |
17 |
-------------------------- |
18 |
|
19 |
NOTE:IMPORTANT: Please keep in mind that each data capture machine has its |
20 |
own independent /home/production. |
21 |
|
22 |
FORMERLY: 1. The Datacapture system for aia/hmi is by convention dcs0/dcs1 |
23 |
respectively. If the spare dcs2 is to be put in place, it is renamed dcs0 |
24 |
or dcs1, and the original machine is renamed dcs2. |
25 |
|
26 |
1. The datacapture machine serving for AIA or HMI is determined by |
27 |
the entries in: |
28 |
|
29 |
/home/production/cvs/JSOC/proj/datacapture/scripts/dsctab.txt |
30 |
|
31 |
This is edited or listed by the program: |
32 |
|
33 |
/home/production/cvs/JSOC/proj/datacapture/scripts> dcstab.pl -h |
34 |
Display or change the datacapture system assignment file. |
35 |
Usage: dcstab [-h][-l][-e] |
36 |
-h = print this help message |
37 |
-l = list the current file contents |
38 |
-e = edit with vi the current file contents |
39 |
|
40 |
For dcs3 the dcstab.txt would look like: |
41 |
AIA=dcs3 |
42 |
HMI=dcs3 |
43 |
|
44 |
|
45 |
1a. The spare dcs2 normally servers as a backup destination of the postgres |
46 |
running on dcs0 and dcs1. You should see this postgres cron job on dcs0 |
47 |
and dcs1, respectively: |
48 |
|
49 |
0,20,40 * * * * /var/lib/pgsql/rsync_pg_dcs0_to_dcs2.pl |
50 |
0,20,40 * * * * /var/lib/pgsql/rsync_pg_dcs1_to_dcs2.pl |
51 |
|
52 |
For this to work, this must be done on dcs0, dcs1 and dcs2, as user |
53 |
postgres, after any reboot: |
54 |
|
55 |
> ssh-agent | head -2 > /var/lib/pgsql/ssh-agent.env |
56 |
> chmod 600 /var/lib/pgsql/ssh-agent.env |
57 |
> source /var/lib/pgsql/ssh-agent.env |
58 |
> ssh-add |
59 |
(The password is same as production's) |
60 |
|
61 |
2. Login as user production via j0. (password is on Jim's whiteboard). |
62 |
|
63 |
3. The Postgres must be running and is started automatically on boot: |
64 |
|
65 |
#######OLD######################### |
66 |
#> ps -ef |grep pg |
67 |
#postgres 4631 1 0 Mar11 ? 00:06:21 /usr/bin/postmaster -D /var/lib/pgsql/data |
68 |
################################### |
69 |
|
70 |
dcs0:/home/production> px postgres |
71 |
postgres 6545 1 0 May04 ? 00:09:50 /usr/local/pgsql-8.4/bin/postgres -D /var/lib/pgsql/dcs0_data |
72 |
|
73 |
4. The root of the datacapture tree is /home/production/cvs/JSOC. |
74 |
The producton runs as user id 388. |
75 |
|
76 |
5. The sum_svc is normally running: |
77 |
|
78 |
> ps -ef |grep sum_svc |
79 |
388 26958 1 0 Jun09 pts/0 00:00:54 sum_svc jsocdc |
80 |
|
81 |
Note the SUMS database is jsocdc. This is a separate DB on each dcs. |
82 |
|
83 |
6. To start/restart the sum_svc and related programs (e.g. tape_svc) do: |
84 |
|
85 |
> sum_start_dc |
86 |
sum_start at 2008.06.16_13:32:23 |
87 |
** NOTE: "soc_pipe_scp jsocdc" still running |
88 |
Do you want me to do a sum_stop followed by a sum_start for you (y or n): |
89 |
|
90 |
You would normally answer 'y' here. |
91 |
|
92 |
7. To run the datacapture gui that will display the data, mark it for archive, |
93 |
optionally extract lev0 and send it on the the pipeline backend, do this: |
94 |
|
95 |
> cd /home/production/cvs/JSOC/proj/datacapture/scripts> |
96 |
> ./socdc |
97 |
|
98 |
All you would normally do is hit "Start Instances for HMI" or AIA for |
99 |
what datacapture machine you are on. |
100 |
|
101 |
8. To optionally extract lev0 do this: |
102 |
|
103 |
> touch /usr/local/logs/soc/LEV0FILEON |
104 |
|
105 |
To stop lev0: |
106 |
|
107 |
> /bin/rm /usr/local/logs/soc/LEV0FILEON |
108 |
|
109 |
The last 100 images for each VC are kept in /tmp/jim. |
110 |
|
111 |
NOTE: If you turn lev0 on, you are going to be data sensitive and you |
112 |
may see things like this, in which case you have to restart socdc: |
113 |
|
114 |
ingest_tlm: /home/production/cvs/EGSE/src/libhmicomp.d/decompress.c:1385: decompress_undotransform: Assertion `N>=(6) && N<=(16)' failed. |
115 |
kill: no process ID specified |
116 |
|
117 |
9. The datacapture machines automatically copies DDS input data to the |
118 |
pipeline backend on /dds/socdc living on d01. This is done by the program: |
119 |
|
120 |
> ps -ef |grep soc_pipe_scp |
121 |
388 21529 21479 0 Jun09 pts/0 00:00:13 soc_pipe_scp /dds/soc2pipe/hmi /dds/socdc/hmi d01i 30 |
122 |
|
123 |
This requires that an ssh-agent be running. If you reboot a dcs machine do: |
124 |
|
125 |
> ssh-agent | head -2 > /var/tmp/ssh-agent.env |
126 |
> chmod 600 /var/tmp/ssh-agent.env |
127 |
> source /var/tmp/ssh-agent.env |
128 |
> ssh-add (or for sonar: ssh-add /home/production/.ssh/id_rsa) |
129 |
(The password is written on my whiteboard) |
130 |
|
131 |
NOTE: on some machines you may have to put the user name in |
132 |
/etc/ssh/allowed_users |
133 |
|
134 |
NOTE: cron jobs use this /var/tmp/ssh-agent.env file |
135 |
|
136 |
If you want another window to use the ssh-agent that is already running do: |
137 |
> source /var/tmp/ssh-agent.env |
138 |
|
139 |
NOTE: on any one machine for user production there s/b just one ssh-agent |
140 |
running. |
141 |
|
142 |
|
143 |
If you see that a dcs has asked for a password, the ssh-agent has failed. |
144 |
You can probably find an error msg on d01 like 'invalid user production'. |
145 |
You should exit the socdc. Make sure there is no soc_pipe_scp still running. |
146 |
Restart the socdc. |
147 |
|
148 |
If you find that there is a hostname for production that is not in the |
149 |
/home/production/.ssh/authorized_keys file then do this on the host that |
150 |
you want to add: |
151 |
|
152 |
Pick up the entry in /home/production/.ssh/id_rsa.pub |
153 |
and put it in this file on the host that you want to have access to |
154 |
(make sure that it's all one line): |
155 |
|
156 |
/home/production/.ssh/authorized_keys |
157 |
|
158 |
NOTE: DO NOT do a ssh-keygen or you will have to update all the host's |
159 |
authorized_keys with the new public key you just generated. |
160 |
|
161 |
If not already active, then do what's shown above for the ssh-agent. |
162 |
|
163 |
|
164 |
10. There should be a cron job running that will archive to the T50 tapes. |
165 |
Note the names are asymmetric for dcs0 and dcs1. |
166 |
|
167 |
30 0-23 * * * /home/production/cvs/jsoc/scripts/tapearc_do |
168 |
|
169 |
00 0-23 * * * /home/production/cvs/jsoc/scripts/tapearc_do_dcs1 |
170 |
|
171 |
In the beginning of the world, before any sum_start_dc, the T50 should have |
172 |
a supply of blank tapes in it's active slots (1-24). A cleaning tape must |
173 |
be in slot 25. The imp/exp slots (26-30) must be vacant. |
174 |
To see the contents of the T50 before startup do: |
175 |
|
176 |
> mtx -f /dev/t50 status |
177 |
|
178 |
Whenever sum_start_dc is called, all the tapes are inventoried and added |
179 |
to the SUMS database if necessary. |
180 |
When a tape is written full by the tapearc_do cron job, the t50view |
181 |
display (see 11. and 12. below) 'Imp/Exp' button will increment its |
182 |
count. Tapes should be exported before the count gets above 5. |
183 |
|
184 |
11. There should be running the t50view program to display/control the |
185 |
tape operations. |
186 |
|
187 |
> t50view -i jsocdc |
188 |
|
189 |
The -i means interactive mode, which will allow you to change tapes. |
190 |
|
191 |
12. Every 2 days, inspect the t50 display for the button on the top row |
192 |
called 'Imp/Exp'. If it is non 0 (and yellow), then some full tapes can be |
193 |
exported from the T50 and new tapes put in for further archiving. |
194 |
|
195 |
Hit the 'Imp/Exp' button. |
196 |
Follow explicitly all the directions. |
197 |
The blank L4 tapes are in the tape room in the computer room. |
198 |
|
199 |
When the tape drive needs cleaning, hit the "Start Cleaning" button on |
200 |
the t50view gui. |
201 |
|
202 |
13. There should be a cron job running as user production on both dcs0 and |
203 |
dcs1 that will set the Offsite_Ack field in the sum_main DB table. |
204 |
20 0 * * * /home/production/tape_verify/scripts/set_sum_main_offsite_ack.pl |
205 |
|
206 |
Where: |
207 |
#/home/production/tape_verify/scripts/set_sum_main_offsite_ack.pl |
208 |
# |
209 |
#This reads the .ver files produced by Tim's |
210 |
#/home/production/tape_verify/scripts/run_remote_tape_verify.pl |
211 |
#A .ver file looks like: |
212 |
## Offsite verify offhost:dds/off2ds/HMI_2008.06.11_01:12:27.ver |
213 |
## Tape 0=success 0=dcs0(aia) |
214 |
#000684L4 0 1 |
215 |
#000701L4 0 1 |
216 |
##END |
217 |
#For each tape that has been verified successfully, this program |
218 |
#sets the Offsite_Ack to 'Y' in the sum_main for all entries |
219 |
#with Arch_Tape = the given tape id. |
220 |
# |
221 |
#The machine names where AIA and HMI processing live |
222 |
#is found in dcstab.txt which must be on either dcs0 or dcs1 |
223 |
|
224 |
14. Other background info is in: |
225 |
|
226 |
http://hmi.stanford.edu/development/JSOC_Documents/Data_Capture_Documents/DataCapture.html |
227 |
|
228 |
***************************dsc3********************************************* |
229 |
NOTE: dcs3 (i.e. offsite datacapture machine shipped to Goddard Nov 2008) |
230 |
|
231 |
At Goddard the dcs3 host name will be changed. See the following for |
232 |
how to accomodate this: |
233 |
|
234 |
/home/production/cvs/JSOC/doc/dcs3_name_change.txt |
235 |
|
236 |
This cron job must be run to clean out the /dds/soc2pipe/[aia,hmi]: |
237 |
0,5,10,15,20,25,30,35,40,45,50,55 * * * * |
238 |
/home/production/cvs/JSOC/proj/datacapture/scripts/rm_soc2pipe.pl |
239 |
|
240 |
Also on dcs3 the offsite_ack check and safe tape check is not done in: |
241 |
/home/production/cvs/JSOC/base/sums/libs/pg/SUMLIB_RmDo.pgc |
242 |
|
243 |
Also on dcs3, because there is no pipeline backend, there is not .arc file |
244 |
ever made for the DDS. |
245 |
***************************dsc3********************************************* |
246 |
|
247 |
Level 0 Backend: |
248 |
-------------------------- |
249 |
|
250 |
!!Make sure run Phil's script for watchlev0 in the background on cl1n001: |
251 |
/home/production/cvs/JSOC/base/sums/scripts/get_dcs_times.csh |
252 |
|
253 |
1. As mentioned above, the datacapture machines automatically copies DDS input |
254 |
data to the pipeline backend on /dds/socdc living on d01. |
255 |
|
256 |
2. The lev0 code runs as ingest_lev0 on the cluster machine cl1n001, |
257 |
which has d01:/dds mounted. cl1n001 can be accessed through j1. |
258 |
|
259 |
3. All 4 instances of ingest_lev0 for the 4 VCs are controlled by |
260 |
/home/production/cvs/JSOC/proj/lev0/apps/doingestlev0.pl |
261 |
|
262 |
If you want to start afresh, kill any ingest_lev0 running (will later be |
263 |
automated). Then do: |
264 |
|
265 |
> cd /home/production/cvs/JSOC/proj/lev0/apps |
266 |
> doingestlev0.pl (actually a link to start_lev0.pl) |
267 |
|
268 |
You will see 4 instances started and the log file names can be seen. |
269 |
You will be advised that to cleanly stop the lev0 processing, run: |
270 |
|
271 |
> stop_lev0.pl |
272 |
|
273 |
It may take awhile for all the ingest_lev0 processes to get to a point |
274 |
where they can stop cleanly. |
275 |
|
276 |
For now, every hour, the ingest_lev0 processes are automatically restarted. |
277 |
|
278 |
|
279 |
4. The output is for the series: |
280 |
|
281 |
hmi.tlmd |
282 |
hmi.lev0d |
283 |
aia.tlmd |
284 |
aia.lev0d |
285 |
|
286 |
#It is all save in DRMS and archived. |
287 |
Only the tlmd is archived. (see below if you want to change the |
288 |
archiving status of a dataseries) |
289 |
|
290 |
5. If something in the backend goes down such that you can't run |
291 |
ingest_lev0, then you may want to start this cron job that will |
292 |
periodically clean out the /dds/socdc dir of the files that are |
293 |
coming in from the datacapture systems. |
294 |
|
295 |
> crontab -l |
296 |
# DO NOT EDIT THIS FILE - edit the master and reinstall. |
297 |
# (/tmp/crontab.XXXXVnxDO9 installed on Mon Jun 16 16:38:46 2008) |
298 |
# (Cron version V5.0 -- $Id: whattodolev0.txt,v 1.9 2010/12/17 18:34:28 production Exp $) |
299 |
#0,20,40 * * * * /home/jim/cvs/jsoc/scripts/pipefe_rm |
300 |
|
301 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
302 |
|
303 |
Starting and stoping SUMS on d02: |
304 |
|
305 |
Login as production on d02 |
306 |
sum_start_d02 |
307 |
|
308 |
(if sums is already running it will ask you if you want to halt it. |
309 |
you normally say 'y'.) |
310 |
|
311 |
sum_stop_d02 |
312 |
if you just want to stop sums. |
313 |
|
314 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
315 |
|
316 |
SUMS archiving: |
317 |
|
318 |
Currently SUM is archiving continuously. The script is: |
319 |
|
320 |
/home/production/cvs/JSOC/base/sums/scripts/tape_do_0.pl (and _1, _2, _3) |
321 |
|
322 |
To halt it do: |
323 |
|
324 |
touch /usr/local/logs/tapearc/TAPEARC_ABORT[0,1,2] |
325 |
|
326 |
Try to keep it running, as there is still much to be archived. |
327 |
|
328 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
329 |
|
330 |
Change archiving status of a dataseries: |
331 |
|
332 |
> psql -h hmidb jsoc |
333 |
|
334 |
jsoc=> update hmi.drms_series set archive=0 where seriesname='hmi.lev0c'; |
335 |
UPDATE 1 |
336 |
jsoc=> \q |
337 |
|
338 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
339 |
|
340 |
The modified dcs reboot procedure is in ~kehcheng/dcs.reboot.notes. |