For more than a decade, I have been using open-source projects. And much recently, I have been analysing several such projects especially on Github1, Gitlab2, etc. and in particular their (code) repositories. Most of these repositories contain the source code of the application, with occasional links to websites for official pages as well as for the complete documentation. And it is quite rare to find a repository that is self-contained, i.e., if a user clones a repository, they will find the complete information to understand the context, purpose and relevant documentation of the repository.
In most of the projects, the entire code is on one website, the documentation on another web site, any application programming interface (API) description on a third website, continuous integration details on yet another website. If one has to request new features or report any bugs, they are once again on one another website. Then there are discussions around a new feature or a bug that can only be found in some mailing lists. In short, one has to search multiple places just to understand what's going on, what the project is all about or why a particular decision was made. Some repositories do provide a single page, giving links to all the above, both within and outside the repository. However, as external websites evolve, a major problem is the link rot2 in these repositories, with links pointing to unavailable resources.
One may see that this could be deliberate or even a business model, yet I find it contrary to the principles of the open-source movement. Many services including the ones stated above have understood this problem and they are trying to integrate a lot of information with the code repositories, like documentation wikis, issue tracking, continuous integration, etc. This is indeed a good step, but are these repositories completely tied to these services and have no possibility to easily switch to other services.
Self-contained Repositories
What is a self-contained repository? A self-contained repository of a project is a repository where everything concerning a project is available in one single place, within the project repository itself. In the repository, users can find everything concerning the project: all discussions, complete documentation both for end-users and developers, screenshots,...
Checklist
What could be the contents of a self-contained repository, apart from the source code? or to put in other words, what is the checklist for a self-contained repository? Given below is a non-exhaustive list, attempting to cover all the key aspects of a project.
- Source code
- Main source code
- Debugging options
- Logging options
- Dependencies: package dependencies, if any
- Unit tests, Integration tests, functional tests, stress tests, performance tests
- Specifications
- Requirement analysis (platform requirements, hardware, software, memory constraints, etc.)
- Design: UML diagrams, flow charts, data flow diagrams
- Verification and validation
- Logo
- Logos in different sizes (especially logos in SVG formats)
- Graphical User Interface (GUI)
- Discussions and resolutions on different design decisions
- Desktop solution
- Web-browser based solution
- Screenshots on different devices (screen-sizes)
- Command-line Interface
- Arguments and options
- Usage
- Application Programming Interface (API)
- REST or other web service API architecture
- SDK (software development kits) in various programming languages
- Todo List
- Feature Requests (e.g. user interface mockups)
- Status
- Issues List
- Bugs
- Status
- Discussions
- Meetings, discussions, and resolutions on issues, todo-lists etc.
- Code Review
- Code reviews, discussions
- Associated code changes
- Packaging
- Packaging - packaging for various OS and distibutions
- Containers (Docker containers, Linux containers, etc. )
- Version Control
- Code evolution history on the (distributed) version control system
- Files to be ignored from the repositories like .gitignore
- Profile
- Development details: continuous integration, code coverage, system and memory profile
- Releases
- Feature releases, issues resolved
- Dependency: new package dependency, if any
- Verification and Validation
- Test case
- Licence
- Source code
- Documentation
- Code of Conduct
- Contributors
- Programming and documentation style
- Documentation (wiki and manpages)
- Quick start, compilation
- Application/program description
- Application compilation, installation, development and testing
- UI Screenshots
- Documentation, command line options, GUI, man pages
Existing works
An interesting work that ensures self-containment is CDE5 (code, data, environment) that enables automatic creation of portable Linux applications. CDE 6,7 ensures that packages are built in a manner that the user can run it in any Linux distribution, irrespective of their versions. The packages generated are self-contained with all the code, data and environment required to run it anywhere. It's available as a command-line util cde in many Linux distributions.
Another more recent approach is to use containers8 for creating portable packages. Dockerfiles 9 used in this purpose contain all instructions that can be run to build a package with the desired code, environment and data on any machine, including virtual machines. However, the use of dockerfiles differs from that of CDE. CDE builds a package containing all the necessary environment, whereas Dockerfiles are just instructions to build such a package. Continuing with our discussion on self-contained repositories, CDE is a much closer work. But self-contained repositories contain more than just code, data and environment. They contain the complete context of a project, as described above in the checklist.
Challenges
As mentioned above, though many current software repository service do integrate many key concepts surrounding a project and not are just limited to providing version control for source code, one major problem is their data portability10. Different services have been popular from time to time, but when these services shut down for many reasons including lack of a sound financial sustenance model, there goes the data as well. Hence, it is important to use open standard data formats like JSON, XML, Markdown, etc. or any formats based on them. Without a doubt, it's important to know whether the application owners wish to make the repository completely accessible over internet browser. In that case, open web standards like HTML, CSS can surely be for documentation. Standards like SVG, Canvas can be used to document user interface mockups along with the screenshots of the developed application.
Behind every application, there is a motivation. This could be based on the understanding of market demands, or something arising from a personal need. Such backdrops are usually seen on the personal blogs of repository contributors. Contributors like designers, developers, bug reporters, etc. have their versions of behind the scene narratives of the application. This personal touch to a repository, especially for applications with a very small contributor base helps others to understand whether the application can be indeed a solution for their problems.
Another challenge is how to manage the self-contained repositories evolve the community size increases. It's important to note that as the community size increases, it may not be feasible to document everything in some textual file formats. In large projects, there may be a need to store and the ability to search bugs, feature requests, todos and issues. Thus, it may require the use of proper data storage solutions like relational and non-relational databases. Here again, self-contained repositories do not constrain the use of any such solution. Having said that, the repository can be called self-contained, if it provides the complete data dump of these databases and if necessary, even the executable of the solution used. Yet, if the data dumps are in open data formats like HTML, there may not be a need to include the executable.
When the self-repository has a very large diverse community base, all relevant documentation cannot be just limited to just one language (like English), but must have multilingual support. The multilingual support has the additional advantage that the application may be accessible to the users across the world. Layered SVG diagrams for mockups can not only provide coarse-grained to fine-grained information related to an application, but can also be used for multilingual documentation. Certainly, such a multilingual repository is a tough exercise, especially considering the manual efforts.
Finally, is there a limit on the repository size? A self-contained repository requires more storage space than regular repositories.
Conclusion
Building a well-documented self-contained repository is clearly a difficult task, especially knowing that a lot of our current practices are dependent on different solutions. Yet, self-containment does not restrict the use of any external solution, but rather that all the data associated with these solutions are present in the repository in an accessible manner. Openness and transparency are the pillars of the open-source movement. Hence, any information associated with an open-source project must be part of the repository.
References
- Github
- https://about.gitlab.com/
- Link rot
- Repository
- Philip J. Guo and Dawson Engler. 2011. CDE: using system call interposition to automatically create portable software packages. In Proceedings of the 2011 USENIX conference on USENIX annual technical conference (USENIXATC’11). USENIX Association, USA, 21.
- CDE: Automatically create portable Linux applications
- CDE
- Docker: Using Linux Containers to Support Portable Application Deployment
- Docker
- Data portability