To achieve the goal of Earth Reader, its design need to resolve the following subproblems:
All data libearth deals with are based on (de facto) standard formats. For example, it stores subscription list and its category hierarchy to an OPML file. OPML have been a de facto standard format to exchange subscription list by feed readers. It also stores all feed data to Atom format (RFC 4287).
Actually the most technologies related to RSS/syndication formats are from early 00’s, and it means they had used XML instead of JSON today we use for the same purpose. OPML is an (though poorly structured) XML format, and Atom also is an XML format.
Since we need to deal with several XML data and not need any other formats, we decided to make something first-class model objects to XML like ORM to relational databases. You can find how it can be used for designing model objects at libearth/feed.py and libearth/subscribe.py. It looks similar to Django ORM and SQLAlchemy, and makes you to deal with XML documents in the same way you use plain Python objects.
Under the hood it does incremental parsing using SAX instead of DOM to reduce memory usage when the document is larger than a hundred megabytes.
Earth Reader data can be shared by multiple installations e.g. desktop apps, mobile apps, web apps. So there must be simultaneous updates between them that could conflict. An important constraint we have is synchronization isn’t done by Earth Reader. We can’t lock files nor do atomic operations on them.
Our solution to this is read-time merge. All data are not shared between installations at least in filesystem level. They have isolated files for the same entities, and libearth merges all of them when it’s loaded into memory. Merged result doesn’t affect to all replicas but only a replica that corresponds to the installation. You can understand the approach similar to DVCS (although there are actually many differences): installations are branches, and updates from others can be pulled to mine. If there are simultaneous changes, these are merged and then committed to mine. If there’s no change for me, simply pull changes from others without merge. A big difference is that there’s no push. You can only do pull others, or wait others to pull yours. It’s because the most of existing synchronization utilities like Dropbox passively works in background. Moreover there could be offline.
Repository abstracts storage backend e.g. filesystem. There might be platforms that have no chance to directly access file system e.g. iOS, and in that case the concept of repository makes you to store data directly to Dropbox or Google Drive instead of filesystem. However in the most cases we will simply use FileSystemRepository even if data are synchronized using Dropbox or rsync.
Session abstracts installations. Every installation has its own session identifier. To be more exact it purposes to distinguish processes, hence every process has its unique identifier even if they are child processes of the same installation e.g. prefork workers.
Every session makes its own file for a document, for example, if there are two sessions identified a and b, two files for a document e.g. doc.xml will be made doc.a.xml and doc.b.xml respectively.
For each change a session merges all changes from other sessions when a document is being loaded (read-time merge).
Stage is a unit of changes i.e. an atomic changes to be merged. It provides transactions for multi threaded environment. If there are simultaneous changes from other sessions or other transactions, these are automatically merged when the currently ongoing transaction is committed.
Stage also provides Route, a convenient interface to access documents. For example, you can read the subscription list by stage.subscriptions, and write it by stage.subscriptions = new_subscriptions. In the similar way you can read a feed by stage.feeds[feed_id], and write it by stage.feeds[feed_id] = new_feed.