Issues discussed in meeting on Monday, Sept. 7

Decisions to be made and Individual Issues
'''1. Should the files be splitted or saved in their entirety? Pros for splitting:'''

a. Large files can be downloaded faster by reading its parts simultaneously. b. Streaming audio/video would be possible. i. No need to wait for the whole file to download. ii. If files are encrypted, again, the whole file needn't be downloaded - only a part of it. iii. It is possible to seek ahead in an audio/video stream - would only require reading of the particular block of the file.    Cons for splitting:

a. Additional overhead of splitting and saving into different nodes. b. Managing the read process - simultaneously downloaded parts must be combined in order.

   Compromise solution:

Establish a threshold for the file size. Only those files larger than the threshold must be split.

2. Create a new file system or use FUSE?

   Pros for new file system:

a. Would work for both Windows and Linux. FUSE is only available for Linux. You would have to write separate file systems for both OSes. b. Since only the driver for the file system can read that partition, access is not possible through external means. Not really i. This ensures that files on the partition cannot be tampered with by the user. Not really - root can read it. ii. Encryption of files will not be necessary. why??       Cons for new file system:

a. Working on FUSE is much simpler. b. Would require creation of drivers for both Linux and Windows. c. Partitioning mechanism needs to incorporated. This also implies that the product cannot be used on the fly. (Will have complex installation.).   3. Kademlia network management or in-house Mirror List ideology?

  Pros for Kademlia:

a. Complexities in management of replicas/mirrors including their interactions will be avoided. b. Hash is used to search for the server of a file on the network - the owner need not know who's storing the file. i. This maintains complete anonymity. ii. Owner needn't maintain a list of mirrors.

   Cons for Kademlia:

a. Implementation is not clear. b. Mirror List approach does not require any particular network architecture (node arrangement) such as circular, tree-based etc.

'''4. Communication protocol to use in case of own n/w management. What I mean by protocol is not just the packet and header level details but also what mechanism would be used like thrift/RPC/RMI/HTTP, etc.... Even google has one such library'''

Issues
Issue 1: Mirror Interaction A, B, C are the first three nodes in the owner O's mirror list. O will store a file f in A, B and C.  Each neighbour only interacts with its neighbours. For example, A hellos B, B hellos both A and C, and C hellos B.  If A goes down: i. B finds out when hellos stop coming. ii. B tells C to make a new mirror. iii. C finds it is the last mirror. Therefore, it makes a new mirror in D.      iv. C interacts with both B and D, B interacts with C only, and D interacts with C only. If B goes down: i. Both A and C realize B's dead. ii. A tries to find the third mirror, and so begins to interact with C.      iii. A tells C to make a new mirror. C already knows it is supposed to make a new mirror. iv. C makes D a mirror. v. A interacts with C, C with both A and D, only D with C only. If C goes down: i. B realizes and asks D to be mirror. ii. A interacts with B, B with both A and D, and D with B only.

   In general, if a node goes down: i. The neighbours realize this. ii. The left neighbour (if any) according to the order of the mirror list will start helloing the right neighbour. iii. Left neighbour tells right neighbour to make a new mirror. Right neighbour also knows it has to make a new mirror. iv. If right neighbour is the right-most mirror, it makes a new mirror. v. Else, it tells the next right-mirror to make a new mirror (and so on). If the dead node comes back up: it deletes the record of the file (and is no longer a mirror). Because: since a new mirror in replacement must have been made, it itself is useless now.

Hello: a hello may contain either of these: i. Self-identifier to inform others the sender. This simply tells the receiver the sender is alive. ii. Self-identifier as well as a hash of the file being mirrored. This not only informs the receiver the existance of the

Issue 2: Check against file corruption Each mirror maintains a metadata table listing all the external files it is hosting. The file contains the following information: filename | hash | mirror list | pointer to other mirror 1 | pointer to other mirror 2 filename | hash | mirror list | pointer to other mirror 1 | pointer to other mirror 2 ...  Periodically, the mirror will check if its metadata hash matches with the hash computed from the file. If not, either the file is corrupt or the metadata. a. It sends the metadata hash to the other current mirrors of the file. b. If the hash matches, then the metadata is correct but the file is corrupt. i. Delete the file, remove the entry from the metadata. ii. Eventually, the other mirrors will make a new mirror. c. If the hash doesn't match, then send the file's computed hash to the other current mirrors. i. If the hash matches, then the metadata file is corrupt. -&gt; Should we fix the hash of the file? May be the entire metadata file has been tampered with. -&gt; Should we rather not delete all entries and all hosted files? Eventually, the companion mirrors of each file will make new mirrors. (See Issue 11). ii. Else, both hashes are wrong. WTF? -&gt; Either kill yourself, as above. (See Issue 11). -&gt; Or maybe the others are corrupt?

'''lol, don't bother too much about this. Use an in-process DB like sqllite or berkley DB which will almost guarantee that the HASH is correct. If they don't match, assume that the file is corrupt.'''

Issue 3: Hello with or without hash? Should a regular hello between mirrors contain the file's hash or not? As noted in the previous example, it is only necessary to share hashes in case a mirror finds its internal hash entries mismatching. Issue 4: Updating files When a file is edited, it is the owner's responsibility to make the changes in each of the mirrors. The sequence of interactions is as follows: a. Owner sends updated file to A.  b. Owner asks A to update hash in its metadata file. c. Owner asks A to compute and send the file's hash as well as the has contained in the metadata file. d. If the two hashes are correct, owner gives an acknowledge. e. Else, if the hash of the metadata is incorrect, owner asks A to replace it again with the file's computed hash. f. Else, if the hash of the file is itself incorrect, the full procedure of sending the file is repeated. g. Process is repeated for mirrors B and C.  Problem: when A is updated along with the hash in its metadata table, how will you prevent it from thinking its file is corrupted? Solution: the self-maintenance process of checking if a file has been corrupted (described in Issue 2) needs to be paused during an update. Or, a flag can be set to skip checking the file being currently updated. Issue 5: Saving on the hellos with hashes Note: This issue is only applicable if all hellos include hashes. To prevent too many hellos and hash transfers between mirrors, we can assign 3 mirrors to each owner (just 3, not a whole mirror list). Each owner picks up only the first 3 mirrors from its list each time a file is to be saved (unlike Issue 6, where it dynamically decides the mirrors from the list for each file). The following configuration will develop eventually: A   B    C   f1    f1    f1   f2    f2    f2   f3    f3    f3   f4    f4    f4   ...    ...    ...   (where f1, f2, f3 and f4 belong to the same owner O). For mirror interaction (hellos) and hash transfer, instead of sending and individual packet for each of the 4 files, a common hash of all the 4 files' hashes will be shared. This means only one packet per owner, instead of one packet per file.

Issue 6: Dynamic Mirror List To choose candidate mirrors for storing a file, the owner always picks up the first 3 mirrors from the mirror list. The order is changed after each file is saved. a. A shift factor k is established. Mirror list: A B C D E F G H I J K L ...  b. The first 3 mirrors are chosen for the first file. Depending on the file size (and/or other parameters), the first mirror is shifted by k*size positions. Mirror list: B C D A E F G H I J K L ... (for example, assuming a shift of 3) c. Now, B is the first mirror, and A could be anywhere in the list. Again, for the next file, the current first 3 mirrors are chosen, and the first mirror B is shifted as previously. d. This procedure goes on for all files. e. Periodically, the mirror list is itself update, adding fresh mirrors found in the system and removing mirrors whose connections are lost. The following configuration may develop eventually: A   B    C    D    E   f1   f1   f1                f2   f2    f2      f3         f3          f3         f4          f4   f4               f5    f5   f5   ...   ...   ...    ...   ...

Issue 7: Encryption If encryption is used, every owner maintains a unique key, which can be used to both encode and decode files. Issue 8:   Parameters to be tracked by/stored on the client per file: File name, size, date modified, date created, file type, file permissions (as per the client's pc), list of mirrors, pointers to the current 3 mirrors, file/directory flag, hash a. Using Kademlia, no mirror list need be maintained. b. Hash is stored for handy reference. This will: i. Avoid recomputation of hash for each read request. ii. Introduce the extra overhead of periodic hash checks to see if file has not gotten changed due to accident or any inadvertent corruption (same as Issue 2).

Issue 9: Local Writing When the owner is writing to a file, a copy of it is saved on the local machine along with the mirrors. This is done because the user may make frequent updates to a file in a short span of time (for example writing a text file), and saving each file on the network may take long spans of time. Hence, the file will be locally saved immediately, but only updated on the network every 1-3 minute(s).

Issue 10: Disk space restrictions The client's disk space allotted to the cloud defines the largest file it can save on it. Because: a. When reading a file, it must be downloaded from a mirror. b. Therefore, the space on the local disk must be large enough to hold it. c. Conversely, the local space available/allocated provides an upper limit to the file that can be downloaded. d. Also, from Issue 9, a file to write must fit the local space available. Ordinarily, the local disk will be containing lots of other nodes' files (may be owner's too). Therefore, for reading/writing a large file that is within the above defined limit, some files must be removed from the disk. Issue 11 handles this. Issue 11:   Deleting a file from a node: a. Only if atleast on more mirror is available for the file. b. Send a special notice to the neighbour mirrors. i. The neighbour mirrors remove the node shift the mirror to the end of the list. ii. The neighbour mirrors establish a new mirror. iii. The neighbour mirrors send an ACK to the node. iv. The node safely removes the file and associated metadata entries from the disk.