Under review
Under review
Rejected: Reviewer 1: The manuscript effectively addresses the challenges of Proteomic Data Storage System with Bioinformatics utilizing a Blockchain. In general, I find the manuscript clearly written and structured. There is not much technical criticism. Some minor comments/Queries: There is much technical criticism. Some major comments/Queries: 1. In abstract, the result of this work must be described briefly with data. The result of this work is not clear. 2. In the Introduction part, provide the motivation and objectives as a separate sub title. 3. In the Related Work section, it will be advisable to place research gap and mention how you addressed the gap. 4. Some good work could be discussed in intro/literature like a) Enhancing security in electronic health records using an adaptive feature-centric polynomial data security model with blockchain integration, b) AFCP Data Security Model for EHR Data Using Blockchain, c) Multiple Precision Arithmetic with Blowfish Crypto Method for Medical Data Storage Using Blockchain Technology, and d) Levarging blockchain for transparency in agriculture supply chain management using IoT and machine learning. 5. The meaning of variables is not clear. Readers will be confused. To help readers' understanding, the authors should add a notation list. 6. Paper contains few grammar mistakes which will be cooperated in final version. 7. Provide a table for system configurations and parameters used. 8. Consider including a subsection titled 'Limitations' in the Results and Discussion section. Reviewer 2: This promises to be a useful piece of work, and I can identify the following contributions: - The authors have tackled the problem of reconciling and managing proteomic data (i.e., biological data pertaining to proteins expressed by organisms and cells) created by researchers and stored in diverse, heterogeneous, and centralized databases. This diversity makes collaboration and information-seeking very challenging, Technically, the systems maintaining the data suffer from lack of scalability, privacy controls, interoperability of data formats, and real-time data synchronization. This, and the intrinsic diversity of the data sources, hampers data integration, and makes research collaboration and information-seeking very challenging. Therefore, the authors' attempt at technically analyzing the - As a solution, the authors have proposed a design for managing and updating such proteomic data using decentralized blockchain systems, which they call ProtChain. They have implemented a proof-of-concept of ProtChain built on microsvervices running in Docker containers. Further, they have presented a comprehensive performance analysis of their implementation, highlighting what works and what needs improvement. - The authors' attempt at technically analyzing the shortcomings of existing solutions and attempting to overcome them is practically very useful for biological researchers working on proteomic data. Also, the attempt to use blockchain technology for such purposes is very interesting for blockchain researchers too. - The related work coverage is quite comprehensive and, in my opinion, the best part of the paper. But overall, the paper has several fundamental issues that make it not publishable in its current form: - The claim, made in several places, that the solution enables seamless data integration and interoperability, is not proven by the solution design nor the implementation. The paper dwells a lot on implementation of a record-keeping system on blockchain and its associated performance, but does not talk about data formats, standardization, reconciliation of data and metadata across diverse and heterogeneous data sources. A claim seems to be made that integrating existing data sources with blockchain networks will automatically enable integration, but this is not justified or explained. In the proof-of-concept implementation, only one data source (PDB) is used, which by definition does not face the challenges listed in the paper's motivation. - It's unclear how a centralized data management system has been decentralized using the presented solution design. Such decentralization is not achieved automatically by using a blockchain network, as the devil lies in the details. In the authors' design, the blockchain network is maintaining all the records of the centralized data source (e.g., PDB), which means that there is no data distribution among different stakeholders. The Hyperledger Fabric peers do manage their network's data in a decentralized manner through consensus, but the data is ultimately a shared replicated ledger, not a distributed database. Further, the repository of raw encrypted data is an IPFS instance, which is a centralized repository. Only for data validation purposes is the blockchain actually used, so decentralized operation forms only a part of the entire proposed system. The design also seems to indicate that the Fabric network being used consists of a single channel, which means that a single instance of the blockchain is managing all the data. Therefore, the claim of the authors' system converting a centralized system to a decentralized system is untenable, given the evidence presented in this paper. - Relevant details of the Hyperledger Fabric network used in the ProtChain design or implementation are not given. The key decentralizing factor in a Fabric network is the number of organizations, each organization representing a different stakeholder in the network who determines ledger state updates through consensus with other organizations (via an endorsement policy). But the list of organizations or what the organizations represent is not given. Also, as mentioned above, the number of channels is not mentioned, though from the chaincode design, it seems like there is only one channel. Further, the ordering service seems to consist of a single node, which is very strange. The consensus protocols supported by Fabric (Kafka, Raft, BFT) are used solely in the ordering service for different ordering nodes to come to a consensus on the transaction ordering within a block. Having a single ordering service node completely defeats the purpose of having a consensus protocol, and further makes the design and implementation completely centralized. Such details need to be called out when using Hyperledger Fabric in a system, otherwise one can end up using a Fabric network as a glorified (and slow) database. - The authors never justify the need for decentralization for their problem domain, at least of the kind that requires blockchains. Centralized databases, with the benefit of decades of practice, are much faster and deliver much higher throughputs than blockchain or DLT networks do. Blockchains don't really do anything to address the problems of seamless interoperability across data sources, formats, and protocols. That problem, in situations where blockchains are used, are largely solved at layers higher than the blockchain peer network, with the smart contracts (chaincodes in Hyperledger Fabric) playing a small role. - The performance measurements, as far as I can see, deal with the blockchain parts and the API server parts separately. Why are there no end-to-end measurements and analysis, from PDB down to the blockchain and IPFS? Is it because Hyperledger Caliper only measures TPS and latency for the blockchain network and does not cover other parts of the application stack? It's hard to get an appreciation for the entire system's performance just by looking at the blockchain, on which the authors spend several pages, which seems excessive since there have been several studies of Fabric performance in the literature going back to 2017 (check out the "Blockbench" paper), with the following among the latest: https://eprint.iacr.org/2023/1717.pdf. - Overall, even if the solution was much better thought out and justified, the notion of storing data off-chain and hash fingerprints on-chain is a standard technique that's been used for a long time. There's nothing novel in this technique and system for a blockchain or system practitioner, though it might be novel in the biology/proteomics domain. Specific page-by-page comments: - General comment: font sizes in the images, charts, and screenshots are too small to read when the paper is printed out. - Page 1: Column 2: Line 55-57: "This technology........enhances the interoperability", and Page 2: Column 1: Line 33-35: "Blockchain can automate...through its smart contract functionalitu". These are examples of assertions made without proper evidence. It may indeed be the case that blockchains and smart contracts can enhance compatibility and interoperability, but it's not obvious how. - Page 2: Table 1: first row, Security column: It's not clear how "semantic reasoning" and controlled vocabularies" are security techniques. - Page 2: Column 1: Line 50-52: "ProtChain also facilitates.....ensuring data format compatibility....'. I couldn't find anything in the design that ensures such compatibility. The blockchain in the authors' design stores data records hashes via a smart contract. How exactly does by itself ensure data compatibility, which is a problem that needs to be solved off-blockchain in my opinion. - Page 4: Table 2: Hyperledger Fabric supports BFT consensus (in addition to RAFT and Kafka) starting with version 3.0 - Page 4: Section 3.1: The detail about Hyperledger Fabric's consensus protocol is incorrect. Fabric version 2.x supports consensus using RAFT, but RAFT is not a BFT protocol. It provides crash fault tolerance, which is strictly a weaker fault model than the Byzantine fault model, which is defined in inequality (1). RAFT, built on Paxos, can actually tolerate up to half the nodes failing. - Page 5: Column 1: Line 38-39: You say "AES-256", which is an encryption algorithm. Did you mean "SHA-256"? - Page 5: Column 1: Line 49-50: I think you mean "E-k is the encryption key"? - Page 5: Column 2: I didn't get the reference to a transaction in equation (6). - Page 5: Column 2: Line 27-30: .....executing the chaincode to validate....": this is also incorrect about Fabric. The chaincode is only executed in the endorsement phase, which occurs before the ordering. After the ordering, the peers only validate the transactions within the blocks before committing; chaincode is not touched in this stage". - Page 6: Figure 4: IPFS is not part of Hyperledger Fabric, so it should not be included in a box marked "Hyperledger Fabric network". - Page 6: Column 1: Line 36-38: if the services are configured to fetch data already in standardized formats, exactly what is the value proposition of ProtChain? - Page 6: Column 1: Line 60: "The contracts also interact with the encryption...". It's unclear how this interaction occurs. - Page 6: Column 2: "SubmitData" chaincode function: this seems to take 'rawData' as input, which is a large string that's supposed to be stored in IPFS, right? Note that anything passed to a chaincode interface ends up being stored in a Fabric block, so passing a long string here sounds very inefficient to me. I'd instead compute the hash off-chain and pass it to "SubmitData" instead. Later validations against IPFS records should still work. - Page 7: Again, in the "SubmitData" function, is there any access cointrol allowing/preventing the recording of data? It just seems as if a new policy is recorded using this function and is enforced only during queries. - Page 7: Algorithm 1: Line 21-24: It looks like the hash is computed off-chain here, which is different from the chaincode implementation shown earlier. - Page 9-10: Please add a citation to "Locust" in the text. - Page 10: I took a look at your GitHub code, and it looks like the test network used for Fabric is a direct copy of a toy example from the "fabric-samples" GitHUb repository with minimal changes. As I mentioned earlier, a Fabric network needs careful designing and shouldn't just be used as a glorified database with default settings. - Page 11: I'm surprised at the query latencies and throoughputs being comparable in order of magnitude to the storage operations. A query, or a read operation in Fabric, ought to be MUCH faster than a write operation, or a transaction. A query simply involves sending an API request to a Fabric peer, whereas a transaction involves sending an API request for endorsement to a peer, followed by an ordering stage, followed by a validation-and-commitment stage. A transaction therefore involves many more and heavier operations than a simple read. Can you review your Caliper masurement scripts to ensure you got the configurations right? - Section 4.1: I think there's too much analysis of Fabric performance that could have been truncated in favor of explaining more about how your solution actually ensures data compatibility and seamless interoperability. Some sample figures for TPS, latency, CPU usage, etc. would have sufficed here as Fabric performance is a well-studies topic (see my earlier comment). - Page 15: Section 4.3: No performance numbers are given. It's just asserted that secure and reliable communications occur within the blockchain network thanks to containerization (which is hardly sufficient). Finally, despite my harsh review, I don't want to discourage the authors from pursuing this line of research and engineering, which I think is quite promising and ought to be practically useful. But they will need to (1) understand the nuances of blockchains (especially Hyperledger Fabric) better, (2) use blockchains wisely and for purposes they are suited for and not as a blunt instrument for decentralized record-keeping, and (3) go back to the drawing board and design a solution to their problem that solves (or at least mitigates) the various challenges they laid out in the opening section. More information and support
No planned progress recorded.
Prepared the slides for the online presentation.
Should make my video presentation and upload.
Submitted
No further actions
Completed the methodology section. Started expanding the result section
To complete the paper and share with my research mentor for reviews before submission.