PROFESSOR: NOW LET'S TURN TO PHYSICAL DATA STRUCTURES. WE START EARLY NOW. IN FACT WE ARE IN GOOD TIME. AND WE HAVE THREE LECTURES TO WORK ON THAT AND THEN LIKE I PROMISE YOU BEFORE THE MIDTERM WE'LL HAVE A REVIEW. BECAUSE SOME PEOPLE ASK WHETHER I WOULD GIVE TIPS AND SUMMARY AND ANSWER IS YES WE'LL DO THAT, WE'LL DO THAT TWO WEEKS FROM NOW. AND NOW WHAT WE'LL DO IS THAT WE'RE GOING TO TALK ABOUT THINGS THAT FOR SOME OF YOU MAY BE ALREADY WELL KNOWN. OKAY? BUT IT WON'T HURT THAT WE REVIEW THOSE CONCEPTS. AND WE STARTED FROM THE TOP. WE STARTED FROM CONCEPTUAL DESIGN OF THE DATABASE, NAMELY, WE LOOK AT USER REQUIREMENTS, WE CAPTURE THE USER REQUIREMENTS AND WE COME UP WITH THE CONCEPTUAL SCHEMA USING THE ER DIAGRAM OR AS YOU LEARN LAST TIME, IF YOU ARE USING THE OBJECT ORIENTED PARADIGM WE CAN ALSO USE UML,. THEN WE CAN DESIGN THE SCHEMA, SO AS FAR AS THE USER IS CONCERNED, HE'S PRETTY HAPPY. HE DESIGN A SCHEMA AND THEN POPULATE A DATABASE THEN HE CAN START QUERY INTO DATABASE. THAT'S WHAT YOU DID IN THE SECOND EXERCISE. BUT THEN WE'RE STILL UP AT THIS LEVEL, THE DATABASE MANAGEMENT SYSTEM LEVEL. THE INFORMATION HAS TO BE MAPPED, THE STORAGE NEEDED. THESE DAYS MOSTLY HARD DISK THEN OF COURSE WE WILL LOAD IT INTO THE MAIN STORAGE AND THEN WE CAN RETRIEVE INFORMATION THIS WAY. SO WHAT WE'RE GOING TO WORRY ABOUT NOW THIS PART. NAMELY, HOW TO WE PHYSICALLY STORAGE DATA, HOW DO WE MAP IT TO THE STORAGE MEDIA SO THAT, NUMBER ONE, WE CAN EFFICIENTLY MANIPULATE DATA, NUMBER TWO, WE CAN ALSO REORGANIZE THE INFORMATION FAIRLY QUICKLY. WHY DO WE REORGANIZE INFORMATION QUICKLY? WELL, BECAUSE AS DATA GROWS, THE STRUCTURE THAT WE IMPOSE AND THE -- AT THE VERY BEGINNING MAY BECOME INADEQUATE AS WE SOON WILL SEE. IF WE USE ALL KIND OF POINTERS THEN THOSE POINTERS MAY BECOME TOO COMPLICATED. THEN WE HAVE TO DO THE OBVIOUS THING, NAMELY WE UNLOAD THE DATABASE, RELOAD THE DATABASE SO WE COME BACK INTO A MORE EFFICIENT STRUCTURE. SO THIS IS THE PART THAT REALLY AFFECT THE EFFICIENCY OF THE SYSTEM. ALTHOUGH FROM END USER'S PERSPECTIVE HE DOESN'T SEE IT. HE DOESN'T EVEN WORRY ABOUT THAT, EXCEPT YOU ALREADY SEE THAT IN SQL LANGUAGE WHEN YOU DECLARE THE SCHEMA YOU CAN CREATE ALL KINDS OF INDEXES AND THOSE INDEXES ARE TO HELP US TO ACCESS THE INFORMATION. OKAY. SO, STARTING FROM THE APPLICATION PROGRAM POINT OF VIEW WE ACCESS INFORMATION THROUGH THE DATABASE AND THEN WE NEED TO USE ONE MORE ACCESS TO GET INFORMATION FROM PHYSICAL STORAGE DEVICE, DISK. AND IN EARLIER TIME THEN WE ALSO USE A LOT OF TAPES SO THAT WE MAP INFORMATION FROM THE TAPE TO THE STORAGE AND THEN WE CAN READ IT FROM THE MAIN STORAGE. NOW TAPE IS A SEQUENTIAL MEDIUM THEREFORE ON THE TAPE NORMALLY WHAT YOU HAVE TO DO IS LAY OUT ONE RECORD AFTER ANOTHER. AND YOU ALSO GROUP THE RECORDS SO THAT YOU CAN HAVE A BUNCH THE PHYSICAL RECORDS AND EACH PHYSICAL RECORD CONSIST OF A NUMBER OF LOGICAL RECORDS. THE REASON BEING THAT WHEN WE READING THE DATA FROM THE TAPE, OR FROM THE DISK IT'S THE SAME THING, BECAUSE IF YOU HAVE DISKS THAT ROTATE THEN YOU NEED TO HAVE SOME GAP BETWEEN THE PHYSICAL RECORD THEN YOU CAN ORGANIZE THE INFORMATION THIS WAY. NOW, EVEN THOUGH NOW THE TAPE IS ALL THE STORAGE MEDIA YOU STILL FIND THAT YOUR PRIMARY DATA OFTENTIMES IS ORGANIZED THIS WAY. LIKE I MENTIONED EARLIER FOR YOUR TERM PROJECT INSTEAD OF THE -- THIRD PROJECT, INSTEAD OF THE ONE YOU FOUND ON THE WEB RIGHT NOW WHICH IS TO DEVELOP A STUDENT EVALUATION SYSTEM, YOU'LL BE WORKING ON A "TOY" LIBRARY INFORMATION SYSTEM. BUT NOT QUITE A TOY. BECAUSE WE'RE GOING TO START FROM THE WHOLE THING. WE'RE GOING TO START FROM RAW DATA. THE RAW DATA FOR THE LIBRARY INFORMATION SYSTEM IS FROM AN ORGANIZATION CALLED OCLC. DON'T ASK ME WHAT THIS STANDS FOR. I THINK STAND FOR ORGANIZATION OF COMPUTER LIBRARY CENTERS. SOMETHING LIKE THAT. IT'S A NATIONAL CENTER LOCATED IN I THINK DAYTON, OHIO AND THEIR JOB IS TO PROVIDE DATA INFORMATION TO ALL OF THE LIBRARIES IN THIS COUNTRY. AS WELL AS OTHER COUNTRIES. IN FACT, THE WAY THAT OUR LIBRARY SYSTEM ORGANIZED IS OF COURSE LIBRARY OF CONGRESS IS THE TOP, RIGHT, AND ALL OF THE BOOKS AND ALL OF THE INFORMATION THAT WE ARE GOING TO ENCODE, THESE DAYS IT NOT ONLY INCLUDE BOOKS, ALSO INCLUDES CD'S AND OTHER TYPES OF MEDIA. THEY'RE IN A RECORD CALLED MARC RECORD. AGAIN, DON'T ASK ME WHAT MARC STANDS FOR. STANDS FOR SOMETHING. NOW, THIS IS THE RECORD WHERE WE ARE GOING TO RECORD ALL OF THE INFORMATION WE NEED ABOUT A BOOK SUCH AS TITLE, ISBN NUMBER WHICH IS UNIVERSAL ACCESS NUMBER, AUTHOR'S NAME, KEY WORDS, ET CETERA. OKAY? THE SO THE MARC RECORD IS ENCODED IN THIS SEQUENTIAL FORM. IN FACT, LET ME DIE VERT A LITTLE BIT. IT'S NOT JUST APPLIED TO LIBRARY INFORMATION SYSTEM. MANY OF THE SYSTEMS BEFORE HISTORICAL AND OTHER REASONS ALWAYS HAVE THIS KIND OF LINEAR STRUCTURE. SO WHEN YOU GO TO PLACES TO WORK, YOU'LL STILL FIND THE SAME THING. YOU WILL DEAL WITH RECORDS LIKE THIS. OKAY. SO, WHAT TYPICALLY HAPPENS WITH THOSE RECORD THAT YOU'RE GOING TO ACCESS AS YOUR RAW DATA IS THAT WE WILL HAVE SOME STRUCTURE INCLUDING A HEADER AND THEN DATA. SO THAT IF YOU WRITE A PROGRAM YOU NEED TO BE ABLE TO DECODE THE HEADER AND THE DATA INFORMATION. NOW, IF YOU THINK ABOUT IT, THINK ABOUT LIBRARY INFORMATION SYSTEM. THERE ARE MANY PIECES OF INFORMATION WHICH ARE CALLED TAGS. IN FACT, EVEN THIS TERMINOLOGY RINGS A BELL. NOW, ACTUALLY -- HTML ALSO CALL TAGS. IF YOU LOOK AT HTM, POSTSCRIPT, MARC RECORD THEY ARE ALL IN THE SAME GENERAL FAMILY OF DIFFERENT WAYS TO ENCODE STRUCTURE DATA. SO IN THE LIBRARY INFORMATION SYSTEM THESE ARE ALSO CALLED TAGS. SO SOME COULD BE NAME OF THE AUTHOR. SOME COULD BE THE TITLE, SOME COULD BE THE SET OF KEY WORDS. SOME COULD BE ISBN NUMBER AND MANY MANY, MANY. THEY ARE, I DON'T KNOW, MAYBE SEVERAL HUNDRED OF THESE TEXTS. SO ONE QUESTION IS THAT CAN THESE TAGS BE FIXED IN LESS? CAN THEY? WERE NOT? STUDENT: REALLY LONG. STUDENT: SET A MAXIMUM. JUST SET A MAX MUM. PROFESSOR: IF I SET A MAXIMUM, WHAT'S WRONG WITH THAT? STUDENT: YOU COULD LOSE INFORMATION. PROFESSOR: NOT ONLY LOSE INFORMATION I WASTE SPACE. IF I SET A FIXED SPACE I COULD BE -- ONE TITLE COULD BE VERY LONG, ANOTHER COULD BE SHORT. AND ALSO THE KEY WORDS COULD BE A LOT OF KEY WORDS, COULD BE A FEW KEY WORDS, NOT TO MENTION THESE DAYS STORE ABSTRACT ONLINE, IN OTHER WORDS , MAYBE IT'S ALREADY HAPPENING, COULD BE ONE TAG TOYS STORE ABSTRACT. SO THIS WILL BE IN GENERAL VARIABLE LENGTH RECORDS. OR VARIABLE LENGTH FIELD. WHICH MEANS THAT THE STRATEGY YOU JUST MENTIONED, NAMELY TO RESERVE A FIXED SPACE IS NOT APPLICABLE. MOREOVER, NOT ALL TAGS ARE PRESENT. WHICH MEANS IF YOU HAVE 500 TAGS AT ANY GIVEN MOMENT FOR A PARTICULAR BOOK YOU MAY BE USING, I DON'T KNOW, 30, SOMETHING. THE REST ARE FOR OTHER PURPOSES BECAUSE, FOR LIBRARY OF CONGRESS, WHICH ACTUALLY IS THE KEY OWNER OF THE MARC RECORD. THEY HAVE A THING ABOUT EVERYTHING. NO MATTER WHAT KIND OF TITLE YOU HAVE, THEY CAN ENCODE IT. BUT FOR YOUR PARTICULAR NEED FOR A PARTICULAR LIBRARY, THIS MAY NOT BET CASE. SO HERE I MAY HAVE TEXT 01, 10, 20, 501. THAT'S IT. AND ALSO VARIABLE. LENGTHS ARE ALSO VARIABLE. SO MY QUESTION IS THE FOLLOWING -- IMAGINE YOU WRITE A PROGRAM IN WHATEVER LANGUAGE, C, JAVA AND YOU TRY TO READ IN THIS RECORD. HOW DO YOU KNOW? HOW DO YOU KNOW WHERE TO FIND TAGS? STUDENT: THEY COULD PUT A TOKEN BETWEEN EACH TAG. LIKE A TOKEN CHARACTER. PROFESSOR: TOKEN CHARACTER. OKAY. ONE SUGGESTION TOYS USE SOME KIND OF MARKER. OKAY. SO WHAT YOU'RE SAYING IS THAT I USE SOME MARKER TO DISTINGUISH THE FIELD. ANYTHING WRONG WITH THIS APPROACH? YEAH? STUDENT: I DON'T KNOW. I WAS GOING TO SUGGEST ANOTHER WAY OF DOING IT. PROFESSOR: GO AHEAD. STUDENT: INSTEAD OF STORING DATA IN THAT RECORD YOU MAY WANT TO STORE POINTER TO THE DATA. PROFESSOR: GOOD THINKING. THE REASON THAT THIS MAY NOT BE A GOOD APPROACH IS THAT IF I DO THIS AND I WANT TO FIND FIELD 20, I HAVE TO DO WHAT? STUDENT: PUT IN 18 COMMAS. PROFESSOR: I HAVE TO SCAN THE WHOLE DAMN THING OF THE THAT'S NOT TOO COOL. IN OTHER WORDS EVERY TIME DO I THIS I HAVE TO DO SEQUENTIAL PROCESSING. I RATHER HAVE RANDOM ACCESS. GO TO THE FIELD. WHAT HE'S SUGGESTING IS WHAT? THE CONCEPT OF POINTERS. OR I CAN SAY AN INDEX. SO WHAT IS AN INDEX? INDEX IS A WAY TO REFER TO DATA. OKAY . SO, I HAVE FOUR FIELDS. SO WHAT I DO IS THAT I WILL HAVE POINTERS POINTING AT THE BEGINNING BYTE OF EACH FIELD. THAT'S THE INDEX STRUCTURE. SO THE INDEX IN A WAY IS A TABLE. YOU SAY THIS SEQUENTIAL MEDIA SO I HAVE TO STORE THAT THIS WAY BUT YOU CAN IMAGINE WHAT THIS LOOK LIKE SHOULD BE SOMETHING LIKE THAT, RIGHT? IT'S A TABLE WHICH POINT AT THE DIFFERENT TAGS OR DIFFERENT FIELDS. THAT'S THE PRIMARY PURPOSE OF INDEX, NAMELY TO REFER TO THINGS. OKAY . IN FACT, FOR THE LIBRARY OF CONGRESS MARC RECORD STRUCTURE WHICH I'M STILL DEBATING, I PROBABLY WILL ADD THIS AS AN EXERCISE, NO AS PART OF YOUR EXERCISE FOUR BECAUSE SOME OF YOU LIKE HANDS-ON AND THIS IS A PROGRAMMING EXERCISE. MAYBE I USE THAT IN LIEU -- I DON'T KNOW I'M STILL DEBATING. SO THESE ARE POINTING AT THE BEGINNING BYTE BUT ALSO NEED TO KNOW THE BASE, THE BASE ADDRESS. IN OTHER WORDS IT'S LIKE I NEED A BASE ADDRESS WHICH IS THE BEGINNING BYTE OF THE FIRST FIELD. AND THEN THESE ARE LIKE DISPLACEMENT, SO THESE ARE RELATIVE ADDRESSES. THESE ARE CALCULATED RELATIVE TO THE FIRST BASE ADDRESS. OKAY. SO IN THE ACTUAL HEADER PART AND THERE IS A SPECIFIC LOCATION. THAT I STORE THE BASE ADDRESS. SO LET ME CLEAN THIS UP SO -- IT'S A LITTLE MESSY NOW. SO BASICALLY WHAT I HAVE IS THAT I HAVE A BASE ADDRESS. I HAVE DISPLACEMENT. AND IN FACT THERE ARE A WHOLE OTHER BUNCH OF INFORMATION LIKE THE FIRST TWO BYTES IN THIS MARC RECORD IS THE TOTAL LENGTH OF THE RECORD. WHY DO I NEED THIS? STUDENT: SO YOU DON'T JUMP PAST THE END OF IT. PROFESSOR: YEAH. IF I WANT TO SKIP THIS I CAN ADD THIS AND JUMP TO THE NEXT RECORD. IN THE TAPE BASE SYSTEM. SO ACTUALLY THESE ORIGIN EIGHT FROM THE DAYS WHEN WE HAVE JUST SEQUENTIAL MEDIA BUT THINGS HAVE GRADUALLY CHANGED. NOW OF COURSE YOU CAN GET MARC RECORD YOU CAN IT FROM OCLC. I'LL GIVE YOU SAMPLES YOU CAN PROGRAM THAT REALLY STORE ON A DISK. BUT YOU CAN SEE HOW IT'S STRUCTURED. IN FACT, THIS IS VERY COMMON. VERY COMMON I WOULD SAY FOR THE LINEAR RECORD STRUCTURE AND WHAT I MAY DECIDE TO DO, IF YOU ENJOY THAT IS THAT I'M SURE YOU WILL, MAYBE GIVE YOU TWO OR THREE OF THIS MARC RECORD AND YOU WRITE A PROGRAM TO FIGURE OUT AND EXTRACT, MAYBE JUST THREE FIELDS, AUTHOR'S NAME, TITLE, STUFF LIKE THAT. AND I'M SURE IT'S VERY INTERESTING BECAUSE THEN YOU GET JOY FROM ALL THIS GRAY SCREEN WHICH YOU DO NOT KNOW ANYTHING ABOUT, YOU CAN ACTUALLY FIND OUT MEANINGFUL INFORMATION FROM THAT. OKAY. SO, TO SUMMARIZE, BASICALLY IN THE RECORD STRUCTURE HAS A HEADER AND INDEX AND THEN THE ACTUAL RECORD. AND THEN THE INPUT ROUTINE MUST DECODE IT, FIGURE OUT WHERE THE TAGS ARE AND THEN READ IT INTO THE STORAGE AND SET UP SOMETHING. AGAIN, JUST GO ONE STEP FARTHER. THESE DAYS WE TALK ABOUT XML AND WHAT ARE XML? THESE ARE ALSO A WHOLE BUNCH OF TAGS BUT THESE TAGS ARE MEANINGFUL TAGS NOW. I'M NOT SAYING THAT HTML TAGS ARE NOT MEANINGFUL BUT THEY'RE PREASSIGNED. YOU DON'T HAVE ABILITY TO CHANGE THAT. IT'S GIVEN TO YOU. BUT IN XML IF YOU HAVE SOME KIND OF APPLICATION YOU CAN SET UP YOUR OWN TAGS. OF COURSE YOU CANNOT DO THIS ARBITRARILY, RIGHT? IF YOU ARE ONLY PERSON USING IN THIS TAG STILL NOBODY CARES, RIGHT? BUT IF A WHOLE COMMUNITY LIKE A LIBRARY COMMUNITY SHARING INFORMATION, THEN YOU CAN DEFINE YOUR OWN TAG AND YOU KNOW HOW TO INTERPRET DATA. OKAY. SO WHAT I'M EXPLAINING IS THAT WE HAVE SEQUENTIAL MEDIA AND WE HAVE THE PRIMARY RECORD WHICH THEN WE CAN LOAD INTO OUR DATABASE, CONSTRUCT INDEX AND WAYS TO ACCESS INFORMATION. AND I JUST MENTIONED OUR FIRST INDEX, IT'S A KIND OF TABLE BUT A TABLE NEED NOT BE IN TABLE FORM. THIS YOU SHOULD IMAGINE IS ALSO A TABLE. ALSO LINEAR. AND SIMPLEST INDENT IS ONE LEVEL INDENT WHERE YOU HAVE SOME KEY OR TAG NAME, WHATEVER, AND THEN YOU POINT TO THE EITHER LOGICAL ADDRESS OR PHYSICAL ADDRESS. OKAY. NOW, WHAT I ASSUME YOU WILL BE ABLE TO DO IS TO CONSTRUCT MULTIPLE LEVEL INDEXES LIKE THIS. IN OTHER WORDS, IF I GIVE YOU A COLLECTION OF RECORDS SUCH AS THE MARC RECORD, YOU WILL BE ABLE TO LOAD THAT INTO THE DATABASE AND BE ABLE TO CONSTRUCT INDEXES. AND THESE ARE CALLED LOADERS WHICH LOAD INFORMATION INTO THE DATABASE. NOW LET'S JUST TAKE A LOOK AT THIS EXAMPLE. WHAT YOU WILL TRY TO DO OR YOUR LOADING PROGRAM TRY TO DO IS THAT YOU WILL ASSIGN THE ADDRESS INTO THE VARIOUS CELLS IN INDEX STRUCTURE WE JUST SAW ONE LEVEL INDEX BUT IT CAN ALSO BE MULTIPLE LEVELS. THE NUMBER OF LEVELS DEPENDS ON HOW LARGE IS YOUR DATA SET AND IF YOU CAN JUST DO VERY QUICK CALCULATION, IN THIS TREE STRUCTURE IF EVERY LEVEL HAVE THREE POINTERS, AND THEN TWO-LEVEL TREE I CAN POINT TO NINE, THREE LEVEL, 27, NO SON, SO FORTH. BUT EACH LEVEL YOU HAVE 200 POINTERS, RIGHT, PRETTY SOON YOU CAN POINT AT MILLIONS OF RECORDS USING SEVERAL LEVELS. WHAT THE LOADER NEEDS TO DO THEN IS TO FIGURE OUT A WAY TO AUTOMATICALLY CONSTRUCT THIS STRUCTURE. NOW, LET'S TAKE A LOOK AT THIS EXAMPLE. FIRST OF ALL, WE ASSUME THAT RECORDS ARE SORTED AND SORTED BY THE PRIMARY KEY. OKAY? SO IN THIS CASE THE PRIMARY KEY IS MAYBE THIS FIELD. AND THEN YOU LOAD IT INTO THE VARIOUS BLOCKS OR PAGES, WHATEVER YOU WANT TO CALL THAT. NOW THE BLOCKING FACTOR IN THIS CASE IS THREE SO I PUT THE FIRST THREE RECORD IN THE FIRST BLOCK, NEXT THREE IN THE NEXT BLOCK, NEXT THREE IN THE NEXT BLOCK. OKAY. NOW, I WANT TO CONSTRUCT THE HIGHER LEVEL INDEX. THE WAY I DO THAT IS THAT I WILL TAKE THE HIGHEST KEY IN EACH BLOCK. NOW SINCE I'M LEADING THIS SEQUENTIALLY THIS ALWAYS WILL BET LAST GUY. I'LL TAKE THIS GUY, THIS GUY AND THIS GUY AND PUT IT ON TO THE TOP LEVEL AND THEN HAVE A POINTER TO POINT AT THIS THREE BLOCKS. YOU CAN ALL DO THAT, RIGHT? NO PROBLEM. SO IT'S SIMPLE PROGRAM TO WRITE. OKAY. ONCE YOU LOAD IT, SUPPOSE YOU WANT TO SEARCH? LET'S SAY YOU WANT TO SEARCH A RECORD 00124. 00124. HOW DO WE SEARCH FOR A RECORD? STUDENT: YOU LOOK AT YOUR PRIMARY INDEX, YOUR TOP LEVEL. PROFESSOR: I HAVE TO ALWAYS START FROM HERE AND DO WHAT? STUDENT: YOU COMPARE THAT NUMBER WITH THE NUMBERS IN THAT LIST AND FIGURE OUT WHICH INTERVAL IT GOES IN. PROFESSOR: I COMPARE SEARCH KEY 00124 WITH ENTRIES IN PRIMARY INDEX AND WHAT CRITERIA DO I USE? STUDENT: LESS THAN. PROFESSOR: LESS THAN. OKAY. IF THE SEARCH KEY IS LESS THAN THE STORED KEY, THIS MEANS I SHOULD BRANCH OUT TO THAT PAGE. STUDENT: LESS THAN OR EQUAL TO. PROFESSOR: YEAH BECAUSE THE LAST ONE COULD BE STORED THERE AS WELL. SO IN THIS CASE 00124 HAPPENS TO BE LESS THAN 01235 SO I GO HERE AND AGAIN, I DO A SEQUENTIAL SCAN AND FOUND IT. THE REASON I CAN DO A SEQUENCE SCAN IS THESE ARE PRETTY SMALL. EVEN IF YOU HAVE 200 ENTRIES , THAT'S NOTHING I SCAN IT QUICKLY. THAT'S NOT A PROBLEM. EACH TIME DID A SEQUENTIAL SCAN OR FIND A MATCH OR IF IT'S IN THE A MATCH I CAN SEARCH FOR LOWER LEVEL. IF I HAVE A NONEXISTING KEY, LET'S SAY 00123, IF IN 00123 I STILL LESS THAN 01235, SO I GO HERE THEN WHAT HAPPENS IS THAT IN THE SEQUENTIAL SEARCH I FIND NO MATCH. OKAY? NOW IF THERE ARE MORE LEVELS, I WILL PROBABLY BRANCH OUT TO THE LEVEL. BUT THIS TURN OUT TO THE BE OF THE LAST LEVEL AND I HAVE TO DECLARE FAILURE. OKAY. SO YOU SHOULD BE ABLE TO DO THE LOAD MEANING CONSTRUCTING THE TREE. BE ABLE TO DO THE SEARCH. ONE HAS TO BE -- YOU WRITE A LITTLE PROGRAM TO DO THAT FOR ME. OKAY. I CAN DO THE LOAD, I CAN DO THE SEARCH. AND HOW ABOUT THE INSERT? SO IF I WANT TO INSERT RECORD WITH KEY 00123 -- YES IN. STUDENT: I WAS GOING ASK SOMETHING ABOUT THE SEARCH. ARE YOU SAYING OPERATOR IS IT LESS THAN OR LESS THAN OR EQUAL TO. PROFESSOR: LESS THAN, EQUAL TO. YOU SEE I'M TAKING THE HIGH KEY, PUT IT AT HIGHER LEVEL. STUDENT: YOU ALWAYS TAKE THE HIGH KEY? PROFESSOR: WELL, THIS IS WHAT WE USUALLY CALL THE DENSE TREE. NOT DANCING DANCE, DENSE. DO I THAT. BUT THE STRUCTURE WE ARE REALLY INTERESTED IN IS THE B TREE. THAT IS DIFFERENT. I'M JUST USING THIS TO MOTIVATE LATER DISCUSSION. OKAY. SO, WE TALK ABOUT SEARCH, NOW WE'LL TALK ABOUT INSERT. SO, IF I RECORD WITH KEY 00123 HOW DO I INSERT? I SHOULD FIRST START WITH A SEARCH, RIGHT? INSERT ALWAYS START WITH A SEARCH BECAUSE IF YOU CONFINE IT YOU DON'T HAVE TO INSERT IT. IT'S ACTUALLY AN ARROW. SO I START FROM HERE, TRYING TO FIND 00123. SO IT'S LESS THAN 01235 AND BRANCH TO THIS HERE, 00123 IS LESS THAN 00124. NOWHERE TO BRANCH, THIS MEANS THERE IS NO SUCH RECORD, I CAN INSERT IT. BUT WHERE? YES? STUDENT: WELL, DO YOU WANT TO KEEP IT STRUCTURED AS THREE -- PROFESSOR: I HAVE TO KEEP IT STRUCTURED. THIS IS THREE BY THREE KIND OF TREE. STUDENT: YOU -- CAN YOU CREATE A NEW ONE? PROFESSOR: THAT'S THE PROBLEM. STUDENT: START A NEW BRANCH FROM 00124. PROFESSOR: OKAY. AS YOU CAN SEE WE ARE FACING A DILEMMA. THIS IS WHAT WE CALL AWE FULL TREE. IT'S TOTALLY FULL. NO ROOM FOR ADDING A NEW RECORD. SO, EITHER I UNLOAD EVERYTHING, RELOAD CREATE A NEW STRUCTURE THAT WILL BE A THREE-LEVEL TREE. OR WHICH IS A SMARTER IDEA, IS THAT LET'S JUST BE A LITTLE BIT FLEXIBLE. I KNOW WE HAVE A TREE LIKE THIS AND I RUN OUT OF ROOM. IN PARTICULAR I RUN OUT OF ROOM HERE. WHY NOT ADD SOMETHING HERE? YOU SEE WHAT I'M DOING? IN OTHER WORDS, I'M USING WHAT WE CALL A TRAILER, A TRAILER RECORD. SO NOW ONCE I START DOING THAT, THERE IS NO LIMIT WHAT I CAN DO IF I INSERT ANOTHER ONE, I CAN STILL ADD IT TO HERE AND SO ON, SO FORTH. BUT WHEN YOU'RE SEARCHING -- YEAH? STUDENT: THAT'S WHAT I WAS ABOUT TO ASK. PROFESSOR: LET'S SAY YOU'RE SEARCHING FOR 00122 WHAT HAPPENED? FIRST IS THE SAME I GO HERE, GO HERE, PREVIOUSLY I SAY IF I DON'T FIND IT HERE, I STOP. NOW I DON'T STOP, RIGHT? I WILL HAVE A POINTER FIELD HERE AND IF THAT POINTER EXISTS -- IF THE POINTER IS NO, MEANING NO TRAILER RECORD, I STOP. BUT IF IT IS THERE IS SOMETHING, I GO HERE AND FOLLOW THE CHAIN, POINTER CHAIN TO SEARCH. IF IT'S NOT THERE THEN IT'S REALLY NOT THERE. YOU SEE WHAT I'M DOING? IN OTHER WORDS EVEN IF IT'S A FULL TREE STRUCTURE JUST BY BEING A LITTLE BIT FLEXIBLE AND ADD THOSE TRAILER RECORDS WE CAN HANDLE THAT BUT I PAY A PRICE. WHICH IS? STUDENT: THE LONGER YOUR TRAILER GETS THE MORE TIME IT'S GOING TO TAKE. PROFESSOR: RIGHT. BECAUSE THE TREE WILL GET HIGHLY IMBALANCED PRETTY SOON. IN YOUR ALGORITHM CLASS YOU LEARN ABOUT THOSE STRUCTURES. THIS CONSTRUCT A BALANCED TREE IS OPTIMUM. GUARANTEED TO GIVE US LONG SEARCH TIME. ONCE HAVE YOU THAT IT'S NOT THAT BALANCED. OKAY? SO WHICH LEADS US TO IN THE -- THE OTHER POSSIBILITY, THAT IS PREVIOUSLY I START WITH A FULL TREE AND RUN INTO TROUBLE. BECAUSE I'M TOO GREEDY. I PUT RECORDS IN EVERY NOTE THEN I SAY TO MYSELF WHY NOT LIKE THIS? LEAVE SOME SPACE HERE. SO IF IT IS FOUR BY FOUR TREE I WILL ONLY FILL THREE RECORDS PER NOTE LEAVING SOME SPACE THERE. NOW IF YOU HAVE INSERT, CHANCES ARE IT CAN GO INTO THIS AREA RATHER THAN INTO THE TRAILER. BUT OF COURSE AFTER A WHILE, THIS DEGENERATES. YOU STILL WILL GO INTO THE TRAILER IF YOU HAVE TOO MANY RECORDS. SO AT THAT TIME THIS IS WHAT WE CALL THE REORGANIZATION. ESSENTIALLY WHAT IT AMOUNTS TO IN REORGANIZATION IS THAT -- ONE IS UNLOAD DATABASE TO AN EXTERNAL MEDIUM. THIS MEANS TAPE OR SOMETHING. AND THEN YOU RELOAD THE DATABASE. WHEN YOU RELOAD OF COURSE NOW YOU CREATE MORE BALANCED STRUCTURE. YOU REBUILD IN EVERYTHING. IT IS THE DATABASE ADMINISTRATOR'S JOB TO DECIDE WHEN HE WANTS TO REORGANIZE DATABASE SO THAT SYSTEM WILL MAINTAIN NEAR OPTIMUM PERFORMANCE. IF YOU DON'T REORGANIZE YOU CAN SEE WHAT HAPPENED. THE STRUCTURE GETS MORE AND MORE OFF BALANCED. BUT YOU DON'T REORGANIZE TOO FREQUENTLY BECAUSE YOU'LL TAKE AWAY PRECIOUS USER TIME WHEN YOU RE-ORG. THAT'S A TRADE OFF THAT EVER DATABASE ADMINISTRATOR WILL PLAY IN HIS OR HER MIND. OKAY. STUDENT: WHILE DATABASE IS BEING REORGANIZED IS IT POSSIBLE TO CREATE A CLONE THAT USERS CAN ACCESS. PROFESSOR: THAT'S POSSIBLE. WHAT WE CAN DO IS THE FOLLOWING. THIS ESPECIALLY PERTAIN WHEN WE DISTRIBUTE DATABASE. I'M DEVIATING BUT I THINK THIS IS GOOD TOPIC. WELL, YOU KNOW, IN WEB WORLD WE CALL MIRROR SIGN. YOU CREATE A CLONE WHICH YOU DON'T EVEN UPDATE IT'S LIKE A SNAPSHOT YOU LET PEOPLE US THAT WHILE YOUR REAL DATABASE YOU'RE UPDATING, REORGANIZE. THAT HAPPEN ALL THE TIME. THIS IS USEFUL WHEN YOU'RE NOT TOO CONCERNED WITH WHAT WE CALL DATA FRESHNESS. NOW, FOR EXAMPLE IF I'M TALKING ABOUT STOCK, STOCK MARKET I CANNOT DID THAT BECAUSE IN STOCK THEY WANT UP TO THE MINUTE, UP TO THE SECOND, RIGHT? THE WAY MY WIFE PLAY, I DON'T WANT TO THINK ABOUT IT. I'M NOT A GAMBLER. BUT SHE MAKES MORE MONEY THAT I DO. WHAT CAN I SAY? BY PLAYING STOCK. ANYWAY, SO GOING BACK TO THIS, I WANT FRESH DATA THEN I CANNOT HAVE A SNAPSHOT THAT IS TOO OLD. I HAVE TO KEEP IT UP TO DATE. BUT IN MANY CASES YOU'RE NOT THAT CONCERNED. YOU CAN LIVE WITH IT. SO GOING BACK TO THIS KIND OF STRUCTURE, WHAT I MENTIONED IS THAT WE CAN CREATE A FULL TREE OR WE CAN LEAVE SOME SLACK IN THERE. ONCE YOU LEAVE SLACK, THAT'S ACTUALLY SPACE WASTED. BUT YOU GAIN? TERMS EFFICIENCY. OKAY. SO IF YOU GO TO THE NOTES YOU WILL FIND MANY VARIATIONS LIKE THE SO-CALLED INDEX SEQUENTIAL DATA STRUCTURE WHERE YOU HAVE PRIMARY INDEX POINT AT BASICALLY PHYSICAL RECORD CONSISTING OF A FEW LOGICAL RECORDS. OKAY. AND THERE ARE A LOT OF DETAILS WHICH I LEAVE YOU TO STUDY BUT LET ME COME TO THE REAL POINT TO OUR DISCUSSION AND THIS IS WHERE YOU CAN SEE THE LOGICAL PICTURE. NAMELY, IF YOU HAVE AGAIN I USE TAPE BUT THIS FILE COULD BE EASILY ON THE DISK, OKAY? SO IF YOU HAVE SOME PRIMARY INPUT DATA TO CONSTRUCT A DATABASE NORMALLY WE HAVE TO FIRST SORT IT BY PRIMARY KEY, THIS WILL HELP US TO BE ABLE TO CREATE THE PRIMARY INDEX. THEN WE HAVE A LOADER WHICH WILL LOAD THE INPUT DATA AND CONSTRUCT THE INDEX AND THE PRIMARY DATA FILE AND THEN THIS CAN ALSO LEAD US TO DIFFERENT WAYS OF CONSTRUCTING THE SO CALLED SECONDARY INDEX. FOR EVERY TABLE, TALK ABOUT RELATIONAL DATABASE, YOU CAN ALSO HAVE A PRIMARY KEY, RIGHT? AND FOR THE PRIMARY KEY YOU CAN CONSTRUCT A PRIMARY INDEX. NORMALLY WE STOP THERE. OKAY? YOU JUST USE THE PRIMARY KEY MOST OFTEN TO ACCESS DATA. BUT PER CHANCE YOU NEED MORE ACCESS THEN WE CAN CONSTRUCT SECONDARY INDEXES. THE SECONDARY INDEXES ARE INDEXES THAT ARE BUILT ANYTHING THAT'S OTHER THAN THE PRIMARY KEY. THIS IS THE PICTURE THAT YOU SHOULD LOOK AT CAREFULLY. BECAUSE THIS SUMMARIZE MANY OF THE DIFFERENT TYPE OF INDEXES. AGAIN, MANY OF THESE ARE JUST OLD HAT TO YOU SO MAIN THING IS THAT YOU NEED TO KNOW THE CONCEPT OF DENSE AND NONDENSE. DENSE MEANS THAT YOU HAVE END KEY -- HAVE YOU A KEY FOR EVERYTHING. SO YOU CAN POINT AT INDIVIDUAL RECORDS. NON DENSE MEANS THAT YOU ONLY CAN POINT AT A CLUSTER OR A BLOCK SUCH AS AGE GROUP. IF I TALK ABOUT AGE GROUP I'M NOT TALKING ABOUT INDIVIDUAL EMPLOYEE RECORD. I'M TALKING ABOUT A GROUP OF RECORDS. THAT'S A NONDENSE INDEX. NORMALLY YOU CAN CONSTRUCT ONE PRIMARY INDEX AND MULTIPLE SECONDARY INDEXES. BUT THE SECONDARY INDEXES ARE THERE ONLY IF YOU REALLY FIND IT USEFUL. IT'S NOT A CRIME IF YOU ARE DATABASE ADMINISTRATOR YOU DON'T CREATE A SECOND INDEX BECAUSE MOST LIKELY THE USER WILL USE A PRIMARY INDEX. BUT IF YOU FIND OUT FOR THIS PARTICULAR APPLICATION LIKE WHAT I JUST MENTIONED MAYBE FOR SOME STATISTIC REGION, THEY ALWAYS TALK AGE GROUP, THEN IT PACE TO CONSTRUCT AN ACCURATE -- EXTRA SECONDARY INDEX THAT WAY. AND THEN THEY ARE ALSO PARTIAL INDEX MEANING THAT YOU DON'T HAVE INDEX FOR EVERYTHING, ONLY FOR SOMETHING AND COMPOSITE INDEX WHICH MEANS YOU COMBINE TWO INDEXES LIKE IN THE CASE OF AGE GROUP, AS WELL AS GENDER TO CREATE INDEX STRUCTURE. WHAT ORACLE SUPPORT IS FOR YOU TO CREATE THE PRIMARY INDEX. EASILY DEFINE THAT AND THEN STOP WORRYING ABOUT IT. OKAY. NOW GOING BACK TO THIS PICTURE OF SECONDARY INDEXES, I KNOW I'M MOVING QUITE FAST BUT AGAIN, THESE ARE PROBABLY NEW MATERIALS SO WE HAVE A NAME THAT IS PRIMARY INDEX. YOU CAN SEE THAT IF THE NAME IS UNIQUE IT POINTS TO EACH INDIVIDUAL LOGICAL RECORD. WE HAVE MARITAL STATUS. THIS IS A SECONDARY INDEX, RIGHT? YOU SEE RECORD 1, 3, 5. HAVE THE SAME INDEX VALUE, M. 2, 4, 6 INDEX VALUE S, MARRIED AND SINGLE. IF I ONLY HAVE THIS, MAYBE FOR SOME REASON I'M ONLY INTERESTED IN MARRIED, THAT'S CALLED A PARTIAL INDEX. AND THIS IS A NONDENSE INDEX. NUMBER OF CHILDREN, IS ALSO NONDENSE INDEX. AGE, INCOME, ET CETERA. OKAY. SO YOU MAINTAIN THE PRIMARY AND THEN YOU CONSTRUCT THE SECONDARY INDEXES. ARE THESE ALL VERY FAMILIAR TO YOU? SORT OF, RIGHT? OKAY. FINALLY, THE LIST. AM I GOING TOO FAST? OKAY? IN INDEX YOU ARE TAKING THE POINTERS, YOU COLLECT THE POINTERS, PUT INTO THE TABLES, THE TABLES THAT YOU CALL INDEX. IN LIST ALSO CALLED MULTILIST WHAT YOU DO IS YOU DISTRIBUTE THE POINTERS. YOU PUT THEM BACK INTO THE RECORD. SO INSTEAD OF HAVING SEPARATE INDEXES, WHAT YOU DO IS HAVE EMBEDDED POINTERS IN THE PRIMARY RECORD STRUCTURE. NOW, THE GOOD THING IS THAT THIS WAY IS LESS CONFUSING, YOU HAVE VERY UNIFORM STRUCTURE SO THAT YOU CAN ACCESS THE RECORD BY FOLLOWING DIFFERENT LISTS. LET'S DO EXAMPLE. SUPPOSE I WANT TO ACCESS ALL MARRIED RECORD OR THE MARRIED RECORD. I GO TO THE LIST HEADER TABLE. IN THE LIST HEADER TABLE BASICALLY I WILL JUST STORE THE HEAD OF THE LIST, SO IN CASE OF M, I HAVE THREE RECORD AND THE FIRST POINTER IS HERE. SO TO ACCESS THOSE RECORDS, WHAT I DO IS TO FOLLOW THE POINTER CHAIN FROM HERE, I FOLLOW POINTER TO THE FIRST RECORD, THEN I HAVE ANOTHER POINTER TO THIS ONE, POINTER THIS ONE, NOW NO RECORD, THIS INDICATES END OF THIS CHAIN. SAME THING IF I WANT TO ACCESS THE SINGLE THE RECORDS WITH THE SINGLE STATUS. NOW, IF I WANT TO ACCESS MARRIED WITH CHILDREN WHAT DO I DO? MARRIED IS ONE, THREE, FIVE. WITH CHILDREN, 0, 1, 2, 3. ALL THESE CHAINED QUALIFY. SO I SHOULD GO THIS CHAIN, THREE AND FIVE, THIS CHAIN WHICH IS TWO, THIS CHAIN WHICH IS ONE AND FULL. STUDENT: SHOULDN'T YOU COMPARE IT WITH ZERO? PROFESSOR: HMM? STUDENT: COMPARE WITH ZERO CHILDREN. PROFESSOR: THIS IS ZERO. THIS IS WITH CHILDREN. I SAID MARRIED WITH CHILDREN. STUDENT: RIGHT. BUT YOU ONLY HAVE TO LOOK AT ONE LINK LIST INSTEAD OF THREE OR MORE. PROFESSOR: RIGHT. SO THIS IS -- IF I FOLLOW ALL THIS CHAIN AND FIRST I TAKE UNION AND TAKE INTERSECTION, THIS IS RATHER STUPID BECAUSE I'M FOLLOWING THREE POINTER CHAINS. WHAT I SHOULD DO IS WHAT? STUDENT: GO THROUGH THE MARRIED LIST AND THEN -- PROFESSOR: AND THEN. STUDENT: EVERY ONE THAT IS IN ZERO CHILDREN -- PROFESSOR: LOOK ACROSS. IN OTHER WORDS I USE ONLY ONE CHAIN. IN FACT, THE MARRIED CHAIN, OKAY? I GO HERE. SINCE I'M AT THIS RECORD, WHY DON'T I JUST READ THE RECORD AND CHECK WHETHER IT HAS CHILDREN OR NOT? GO HERE, CHECK HERE, GO HERE, CHECK HERE. OKAY? STUDENT: HOW WOULD YOU PHYSICALLY DO THAT? CHECK TO SEE IF THEY HAVE CHILDREN OR NOT? OR ARE THERE FOUR DIFFERENT -- IN THAT NO-CH IS THERE -- PROFESSOR: THESE ARE RECORDS, INDIVIDUAL RECORDS AND THIS FIELD IS NUMBER OF CHILDREN, ZERO, ONE, TWO, THREE, WHATEVER. STUDENT: OKAY. IS THERE DATA STORED IN THERE ALONG WITH THAT POINTER. PROFESSOR: YES, DEFINITELY. THAT'S THE POINT. IN OTHER WORDS, FOR THE MULTI LIST STRUCTURE THE BEAUTY IS THAT YOU HAVE ALL OF THE INFORMATION TOGETHER WITH THE POINTERS. SO IF I DRAW INDIVIDUAL ROW FOR YOU, SO WHAT I'M STORING THERE MAY BE BROWN AND THEN THE POINTER HERE AND NUMBER OF CHILDREN, POINTER HERE, WHAT OTHER FIELD? AGE, POINTER HERE, ET CETERA. YOU SEE? SO THE POINTER AND THE DATA VALUE ATTRIBUTE VALUE, ARE MIXED. IN FACT YOU HAVE VALUE, POINTER, VALUE, POINTER IN PAIRS. SO WE CAN ACCESS THIS STRUCTURE BY FOLLOWING THE POINTER AND THEN IF NEED BE I CAN LOOK ACROSS AND FIND THE VALUE OF ANOTHER FIELD. YEAH? STUDENT: WHY ISN'T IT PHYSICALLY SEPARATED AS PART OF THE HEADER OF THE RECORD? WHY IS IT MIXED IN WITH THE DATA PORTION OF THE RECORD? PROFESSOR: BECAUSE OF THE SPEED OF ACCESS. NOW THIS RECORD, I STORED IT -- LET'S SAY I'VE STORED IN ON DISK. IT WILL BE ON SAME DISK OF THE -- SAME TRACK OF THE DISK. WHEN I ACCESS I READ IT IN AT THE SAME TIME. I GET IT FOR FREE. IF I READ IT IT'S ALREADY THERE IN MY BUFFER. WHY NOT USE IT? STUDENT: ALSO STILL BE THAT SEPARATE HEADER TO THE POINTERS TO EACH FIELD, RIGHT? PROFESSOR: CORRECT. RIGHT. BOTH. OKAY. NOW, THE REASON I MENTION THIS IS THAT THESE ARE TWO DIFFERENT PHILOSOPHY YOU MAY SAY, EITHER YOU TAKE THE POINTERS OUT, PUT IT INTO TABLES, YOU CALL THEM INDEX AND YOU CAN SEE THE VERSION OF THAT OR WHY NOT PUT THE POINTERS BACK INTO THE PRIMARY STRUCTURE? I HOPE YOU SEE THAT THIS IS MORE ELEGANT THIS WAY. IT'S MORE UNIFORM. YOU DON'T CREATE WHOLE BUNCH OF DIFFERENT INDEX TABLES. DO YOU HAVE A QUESTION? STUDENT: NO. PROFESSOR: OKAY. YEAH? STUDENT: INDEXES INCREASING OVERHEAD, TOO, RIGHT BUT THIS DOESN'T? PROFESSOR: IT DOES. STUDENT: NOT AS MUCH? PROFESSOR: NO, IT'S THE SAME. BECAUSE FOR EVERY POINTER, EITHER STORE IT IN INDEX OR I STORE IT IN PRIMARY STRUCTURE. IT STILL OCCUPY, LET'S SAY I USE TWO BYTE. SAME TWO BYTE. NO SAVING. IN FACT, THE MAIN THING IS THAT THIS MAKES PROGRAMMING MORE ELEGANT, EASIER. WHERE AS IN THE INDEXING THING YOU TAKE THEM OUT. YOU HAVE TO DEAL WITH ALL DIFFERENT CASES. THEY ARE DIFFERENT PHILOSOPHIES DEALING WITH THINGS. BUT YOU HATE THIS, RIGHT WITH YOUR POINTERS IT LOOKS UGLY, YOU ALWAYS HAVE TO DEAL WITH HOW TO UPDATE POINTERS SOMETHING LIKE THAT. THAT'S WHY RELATIONAL DATABASE. JUST RELATIONAL DATABASE HAVE POINTERS? AT LEAST FROM USER'S MIND, NO POINTER. OKAY. IT'S A POINTERLESS DATABASE. NOT TO BE CONFUSED WITH A POINTLESS DATABASE. THERE IS PINT BUT THERE IS POINTERLESS. NO POINTER. THEN YOU USE OPERATORS. COMBINE THEM OR JOIN OPERATORS THAT'S THE NICE THING ABOUT IT. BUT STILL FROM THE IMPLEMENTATION POINT OF VIEW, I'M LYING TO YOU . IN RELATIONAL DATABASE THERE ARE POINTERS, ONLY YOU DON'T SEE THEM. NOW YOU SEE IT, NO NOW YOU DON'T. TODAY WE TALK MULTILIST AND THESE INDEX STRUCTURES. LIKE I SAY, I PROBABLY WILL UPDATE EXERCISE TO GIVE YOU OPTIONS OR YOU CAN DO THIS DECODING OF THE MARC RECORD. IT'S FUN AND USUALLY YOU GET EXPERIENCE OF SOME REAL RECORD DECODING. AND WHAT WE'LL DO NEXT WEEK IS THAT WE WILL QUICKLY TALK ABOUT THE B TREE AND EXTENDIBLE HASHING. NOW YOU NEED TO START WITH YOUR NEXT EXERCISE FAIRLY QUICKLY. OKAY? SO THIS WAY YOU CAN ASK ME QUESTIONS SENSIBLY COME REVIEW TIME BECAUSE THIS WILL DEFINITELY BE ON THE TEST, I MEAN B TREE AND HASHING. THAT'S IT. 24 25