ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/JSOC/doc/whattodolev0.txt
Revision: 1.9
Committed: Fri Dec 17 18:34:28 2010 UTC (12 years, 9 months ago) by production
Content type: text/plain
Branch: MAIN
CVS Tags: NetDRMS_Ver_6-2, NetDRMS_Ver_6-3, NetDRMS_Ver_6-0, NetDRMS_Ver_6-1, Ver_6-0, Ver_6-1, Ver_6-2, Ver_6-3, NetDRMS_Ver_2-7, Ver_5-14, Ver_5-13, NetDRMS_Ver_9-9
Changes since 1.8: +12 -4 lines
Log Message:
update

File Contents

# Content
1 /home/production/cvs/JSOC/doc/whattodolev0.txt 25Nov2008
2
3 ------------------------------------------------
4 WARNING!! Some of this is outdated. 3Jun2010
5 Please see more recent what*.txt files, e.g.
6 whattodo_start_stop_lev1_0_sums.txt
7 ------------------------------------------------
8
9 ------------------------------------------------------
10 Running Datacapture & Pipeline Backend lev0 Processing
11 ------------------------------------------------------
12
13
14 NOTE: For now, this is all done from the xim w/s (Jim's office)
15
16 Datacapture:
17 --------------------------
18
19 NOTE:IMPORTANT: Please keep in mind that each data capture machine has its
20 own independent /home/production.
21
22 FORMERLY: 1. The Datacapture system for aia/hmi is by convention dcs0/dcs1
23 respectively. If the spare dcs2 is to be put in place, it is renamed dcs0
24 or dcs1, and the original machine is renamed dcs2.
25
26 1. The datacapture machine serving for AIA or HMI is determined by
27 the entries in:
28
29 /home/production/cvs/JSOC/proj/datacapture/scripts/dsctab.txt
30
31 This is edited or listed by the program:
32
33 /home/production/cvs/JSOC/proj/datacapture/scripts> dcstab.pl -h
34 Display or change the datacapture system assignment file.
35 Usage: dcstab [-h][-l][-e]
36 -h = print this help message
37 -l = list the current file contents
38 -e = edit with vi the current file contents
39
40 For dcs3 the dcstab.txt would look like:
41 AIA=dcs3
42 HMI=dcs3
43
44
45 1a. The spare dcs2 normally servers as a backup destination of the postgres
46 running on dcs0 and dcs1. You should see this postgres cron job on dcs0
47 and dcs1, respectively:
48
49 0,20,40 * * * * /var/lib/pgsql/rsync_pg_dcs0_to_dcs2.pl
50 0,20,40 * * * * /var/lib/pgsql/rsync_pg_dcs1_to_dcs2.pl
51
52 For this to work, this must be done on dcs0, dcs1 and dcs2, as user
53 postgres, after any reboot:
54
55 > ssh-agent | head -2 > /var/lib/pgsql/ssh-agent.env
56 > chmod 600 /var/lib/pgsql/ssh-agent.env
57 > source /var/lib/pgsql/ssh-agent.env
58 > ssh-add
59 (The password is written on my whiteboard (same as production's))
60
61 2. Login as user production via j0. (password is on Jim's whiteboard).
62
63 3. The Postgres must be running and is started automatically on boot:
64
65 > ps -ef |grep pg
66 postgres 4631 1 0 Mar11 ? 00:06:21 /usr/bin/postmaster -D /var/lib/pgsql/data
67
68 4. The root of the datacapture tree is /home/production/cvs/JSOC.
69 The producton runs as user id 388.
70
71 5. The sum_svc is normally running:
72
73 > ps -ef |grep sum_svc
74 388 26958 1 0 Jun09 pts/0 00:00:54 sum_svc jsocdc
75
76 Note the SUMS database is jsocdc. This is a separate DB on each dcs.
77
78 6. To start/restart the sum_svc and related programs (e.g. tape_svc) do:
79
80 > sum_start_dc
81 sum_start at 2008.06.16_13:32:23
82 ** NOTE: "soc_pipe_scp jsocdc" still running
83 Do you want me to do a sum_stop followed by a sum_start for you (y or n):
84
85 You would normally answer 'y' here.
86
87 7. To run the datacapture gui that will display the data, mark it for archive,
88 optionally extract lev0 and send it on the the pipeline backend, do this:
89
90 > cd /home/production/cvs/JSOC/proj/datacapture/scripts>
91 > ./socdc
92
93 All you would normally do is hit "Start Instances for HMI" or AIA for
94 what datacapture machine you are on.
95
96 8. To optionally extract lev0 do this:
97
98 > touch /usr/local/logs/soc/LEV0FILEON
99
100 To stop lev0:
101
102 > /bin/rm /usr/local/logs/soc/LEV0FILEON
103
104 The last 100 images for each VC are kept in /tmp/jim.
105
106 NOTE: If you turn lev0 on, you are going to be data sensitive and you
107 may see things like this, in which case you have to restart socdc:
108
109 ingest_tlm: /home/production/cvs/EGSE/src/libhmicomp.d/decompress.c:1385: decompress_undotransform: Assertion `N>=(6) && N<=(16)' failed.
110 kill: no process ID specified
111
112 9. The datacapture machines automatically copies DDS input data to the
113 pipeline backend on /dds/socdc living on d01. This is done by the program:
114
115 > ps -ef |grep soc_pipe_scp
116 388 21529 21479 0 Jun09 pts/0 00:00:13 soc_pipe_scp /dds/soc2pipe/hmi /dds/socdc/hmi d01i 30
117
118 This requires that an ssh-agent be running. If you reboot a dcs machine do:
119
120 > ssh-agent | head -2 > /var/tmp/ssh-agent.env
121 > chmod 600 /var/tmp/ssh-agent.env
122 > source /var/tmp/ssh-agent.env
123 > ssh-add (or for sonar: ssh-add /home/production/.ssh/id_rsa)
124 (The password is written on my whiteboard)
125
126 NOTE: cron jobs use this /var/tmp/ssh-agent.env file
127
128 If you want another window to use the ssh-agent that is already running do:
129 > source /var/tmp/ssh-agent.env
130
131 NOTE: on any one machine for user production there s/b just one ssh-agent
132 running.
133
134
135 If you see that a dcs has asked for a password, the ssh-agent has failed.
136 You can probably find an error msg on d01 like 'invalid user production'.
137 You should exit the socdc. Make sure there is no soc_pipe_scp still running.
138 Restart the socdc.
139
140 If you find that there is a hostname for production that is not in the
141 /home/production/.ssh/authorized_keys file then do this on the host that
142 you want to add:
143
144 Pick up the entry in /home/production/.ssh/id_rsa.pub
145 and put it in this file on the host that you want to have access to
146 (make sure that it's all one line):
147
148 /home/production/.ssh/authorized_keys
149
150 NOTE: DO NOT do a ssh-keygen or you will have to update all the host's
151 authorized_keys with the new public key you just generated.
152
153 If not already active, then do what's shown above for the ssh-agent.
154
155
156 10. There should be a cron job running that will archive to the T50 tapes.
157 Note the names are asymmetric for dcs0 and dcs1.
158
159 30 0-23 * * * /home/production/cvs/jsoc/scripts/tapearc_do
160
161 00 0-23 * * * /home/production/cvs/jsoc/scripts/tapearc_do_dcs1
162
163 In the beginning of the world, before any sum_start_dc, the T50 should have
164 a supply of blank tapes in it's active slots (1-24). A cleaning tape must
165 be in slot 25. The imp/exp slots (26-30) must be vacant.
166 To see the contents of the T50 before startup do:
167
168 > mtx -f /dev/t50 status
169
170 Whenever sum_start_dc is called, all the tapes are inventoried and added
171 to the SUMS database if necessary.
172 When a tape is written full by the tapearc_do cron job, the t50view
173 display (see 11. and 12. below) 'Imp/Exp' button will increment its
174 count. Tapes should be exported before the count gets above 5.
175
176 11. There should be running the t50view program to display/control the
177 tape operations.
178
179 > t50view -i jsocdc
180
181 The -i means interactive mode, which will allow you to change tapes.
182
183 12. Every 2 days, inspect the t50 display for the button on the top row
184 called 'Imp/Exp'. If it is non 0 (and yellow), then some full tapes can be
185 exported from the T50 and new tapes put in for further archiving.
186
187 Hit the 'Imp/Exp' button.
188 Follow explicitly all the directions.
189 The blank L4 tapes are in the tape room in the computer room.
190
191 When the tape drive needs cleaning, hit the "Start Cleaning" button on
192 the t50view gui.
193
194 13. There should be a cron job running as user production on both dcs0 and
195 dcs1 that will set the Offsite_Ack field in the sum_main DB table.
196 20 0 * * * /home/production/tape_verify/scripts/set_sum_main_offsite_ack.pl
197
198 Where:
199 #/home/production/tape_verify/scripts/set_sum_main_offsite_ack.pl
200 #
201 #This reads the .ver files produced by Tim's
202 #/home/production/tape_verify/scripts/run_remote_tape_verify.pl
203 #A .ver file looks like:
204 ## Offsite verify offhost:dds/off2ds/HMI_2008.06.11_01:12:27.ver
205 ## Tape 0=success 0=dcs0(aia)
206 #000684L4 0 1
207 #000701L4 0 1
208 ##END
209 #For each tape that has been verified successfully, this program
210 #sets the Offsite_Ack to 'Y' in the sum_main for all entries
211 #with Arch_Tape = the given tape id.
212 #
213 #The machine names where AIA and HMI processing live
214 #is found in dcstab.txt which must be on either dcs0 or dcs1
215
216 14. Other background info is in:
217
218 http://hmi.stanford.edu/development/JSOC_Documents/Data_Capture_Documents/DataCapture.html
219
220 ***************************dsc3*********************************************
221 NOTE: dcs3 (i.e. offsite datacapture machine shipped to Goddard Nov 2008)
222
223 At Goddard the dcs3 host name will be changed. See the following for
224 how to accomodate this:
225
226 /home/production/cvs/JSOC/doc/dcs3_name_change.txt
227
228 This cron job must be run to clean out the /dds/soc2pipe/[aia,hmi]:
229 0,5,10,15,20,25,30,35,40,45,50,55 * * * *
230 /home/production/cvs/JSOC/proj/datacapture/scripts/rm_soc2pipe.pl
231
232 Also on dcs3 the offsite_ack check and safe tape check is not done in:
233 /home/production/cvs/JSOC/base/sums/libs/pg/SUMLIB_RmDo.pgc
234
235 Also on dcs3, because there is no pipeline backend, there is not .arc file
236 ever made for the DDS.
237 ***************************dsc3*********************************************
238
239 Level 0 Backend:
240 --------------------------
241
242 !!Make sure run Phil's script for watchlev0 in the background on cl1n001:
243 /home/production/cvs/JSOC/base/sums/scripts/get_dcs_times.csh
244
245 1. As mentioned above, the datacapture machines automatically copies DDS input
246 data to the pipeline backend on /dds/socdc living on d01.
247
248 2. The lev0 code runs as ingest_lev0 on the cluster machine cl1n001,
249 which has d01:/dds mounted. cl1n001 can be accessed through j1.
250
251 3. All 4 instances of ingest_lev0 for the 4 VCs are controlled by
252 /home/production/cvs/JSOC/proj/lev0/apps/doingestlev0.pl
253
254 If you want to start afresh, kill any ingest_lev0 running (will later be
255 automated). Then do:
256
257 > cd /home/production/cvs/JSOC/proj/lev0/apps
258 > doingestlev0.pl (actually a link to start_lev0.pl)
259
260 You will see 4 instances started and the log file names can be seen.
261 You will be advised that to cleanly stop the lev0 processing, run:
262
263 > stop_lev0.pl
264
265 It may take awhile for all the ingest_lev0 processes to get to a point
266 where they can stop cleanly.
267
268 For now, every hour, the ingest_lev0 processes are automatically restarted.
269
270
271 4. The output is for the series:
272
273 hmi.tlmd
274 hmi.lev0d
275 aia.tlmd
276 aia.lev0d
277
278 #It is all save in DRMS and archived.
279 Only the tlmd is archived. (see below if you want to change the
280 archiving status of a dataseries)
281
282 5. If something in the backend goes down such that you can't run
283 ingest_lev0, then you may want to start this cron job that will
284 periodically clean out the /dds/socdc dir of the files that are
285 coming in from the datacapture systems.
286
287 > crontab -l
288 # DO NOT EDIT THIS FILE - edit the master and reinstall.
289 # (/tmp/crontab.XXXXVnxDO9 installed on Mon Jun 16 16:38:46 2008)
290 # (Cron version V5.0 -- $Id: whattodolev0.txt,v 1.8 2009/08/03 18:24:23 production Exp $)
291 #0,20,40 * * * * /home/jim/cvs/jsoc/scripts/pipefe_rm
292
293 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
294
295 Starting and stoping SUMS on d02:
296
297 Login as production on d02
298 sum_start_d02
299
300 (if sums is already running it will ask you if you want to halt it.
301 you normally say 'y'.)
302
303 sum_stop_d02
304 if you just want to stop sums.
305
306 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
307
308 SUMS archiving:
309
310 Currently SUM is archiving continuously. The script is:
311
312 /home/production/cvs/JSOC/base/sums/scripts/tape_do_0.pl (and _1, _2, _3)
313
314 To halt it do:
315
316 touch /usr/local/logs/tapearc/TAPEARC_ABORT[0,1,2]
317
318 Try to keep it running, as there is still much to be archived.
319
320 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
321
322 Change archiving status of a dataseries:
323
324 > psql -h hmidb jsoc
325
326 jsoc=> update hmi.drms_series set archive=0 where seriesname='hmi.lev0c';
327 UPDATE 1
328 jsoc=> \q
329
330 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
331
332 The modified dcs reboot procedure is in ~kehcheng/dcs.reboot.notes.