Opened 12 years ago

Closed 9 years ago

#385 closed task (fixed)

faster copyright update

Reported by: Peter Johansson Owned by: Peter Johansson
Priority: major Milestone: svndigest 0.10
Component: svncopyright Version: trunk
Keywords: Cc:

Description

This is just an idea.

Requires ticket:307

As now copyright update uses the AddStats to retrieve needed information. When there is cache available this is fine, but when not we need to parse from the repository which is expensive. It involves a blame call for every revision in the log. For the sake of copyright, this is not really needed. For example, if we have the following log

r2 | jari | 2005-12-20 15:54:03 -0500 (Tue, 20 Dec 2005) | 1 line
r5 | peter | 2005-12-29 15:52:24 -0500 (Thu, 29 Dec 2005) | 1 line
r6 | peter | 2005-12-29 16:00:00 -0500 (Thu, 29 Dec 2005) | 1 line
r8 | jari | 2005-12-30 02:09:36 -0500 (Fri, 30 Dec 2005) | 1 line
r12 | peter | 2005-12-30 08:33:47 -0500 (Fri, 30 Dec 2005) | 1 line
r21 | jari | 2005-12-30 18:20:00 -0500 (Fri, 30 Dec 2005) | 1 line
r81 | jari | 2006-03-11 18:47:40 -0500 (Sat, 11 Mar 2006) | 1 line
r84 | jari | 2006-03-13 17:04:34 -0500 (Mon, 13 Mar 2006) | 1 line
r96 | jari | 2006-03-28 18:00:51 -0500 (Tue, 28 Mar 2006) | 1 line
r138 | jari | 2006-08-03 17:40:15 -0400 (Thu, 03 Aug 2006) | 1 line
r146 | jari | 2006-08-10 04:14:58 -0400 (Thu, 10 Aug 2006) | 1 line
r149 | jari | 2006-08-12 05:11:46 -0400 (Sat, 12 Aug 2006) | 1 line
r153 | jari | 2006-08-13 18:15:55 -0400 (Sun, 13 Aug 2006) | 1 line
r156 | jari | 2006-08-14 14:49:36 -0400 (Mon, 14 Aug 2006) | 1 line
r157 | jari | 2006-08-14 14:53:21 -0400 (Mon, 14 Aug 2006) | 1 line
r160 | jari | 2006-08-19 17:19:06 -0400 (Sat, 19 Aug 2006) | 2 lines
r161 | jari | 2006-08-19 17:26:53 -0400 (Sat, 19 Aug 2006) | 1 line
r162 | jari | 2006-08-19 19:01:59 -0400 (Sat, 19 Aug 2006) | 1 line
r164 | jari | 2006-08-23 18:51:01 -0400 (Wed, 23 Aug 2006) | 1 line
r167 | jari | 2006-08-25 18:38:52 -0400 (Fri, 25 Aug 2006) | 1 line
r168 | jari | 2006-08-25 18:40:10 -0400 (Fri, 25 Aug 2006) | 1 line
r171 | jari | 2006-08-26 17:47:43 -0400 (Sat, 26 Aug 2006) | 3 lines
r172 | jari | 2006-08-26 18:15:31 -0400 (Sat, 26 Aug 2006) | 1 line
r192 | jari | 2006-09-07 07:27:50 -0400 (Thu, 07 Sep 2006) | 1 line
r195 | jari | 2006-09-07 09:44:35 -0400 (Thu, 07 Sep 2006) | 1 line
r219 | jari | 2006-12-24 16:39:35 -0500 (Sun, 24 Dec 2006) | 1 line
r220 | jari | 2006-12-24 17:53:05 -0500 (Sun, 24 Dec 2006) | 1 line
r279 | peter | 2007-05-06 05:39:47 -0400 (Sun, 06 May 2007) | 1 line
r358 | peter | 2007-06-06 18:42:50 -0400 (Wed, 06 Jun 2007) | 1 line
r368 | peter | 2007-06-12 15:05:14 -0400 (Tue, 12 Jun 2007) | 1 line
r383 | jari | 2007-06-24 18:04:58 -0400 (Sun, 24 Jun 2007) | 1 line

we can start analyzing r2 where we find that jari added something in that changeset and add him to copyright for year 2005. Then we analyze r5 and conclude same thing about peter. Next rev in log is r6 with peter, but peter is already added to year 2005 so no need to analyze that rev. Same things for revs 8, 12, and 21. Next rev of interest is r81 where we find that we should add jari to copyright for year 2006. Then we can skip all remaining revs for 2006 because they all belong to jari and he is already added. Next rev to analyze is then r279... et cetera.

Obviously, this would be much faster than parsing every revision. Also we could skip parsing the BlameStats (only that is a factor 2).

Some problems, or rather design decision we need to make. First, I put ticket:307 as a requirement. Otherwise we have to take into account the case when updating copyright as well as output a report.

Second, is how to deal with the cache. We obviously wanna allow cache also for this kind of statistics, but since it other kind of statistics we need to keep the cache separate from the other stats cache.

  1. If there is copyright cache available, use it (that is obvious).
  2. No copyright update available but there is cache for AddStats, should we then translate that into Copyright Info?
  3. Then comes the question about partial cache. Say that we have partial cache for both AddStats and copyright. Which one should we then utilize?

Perhaps the best way is to keep Stats and Copyright separate, i.e., no crosstalk between the cache. Then question in 1) and 2) would be No No. You lose some speed doing so, but the benefits are numerous. First, the applications are more separate which is probably clearer for end-user. Having two sources for the copyright info (through AddStats) and from the algorithm described above, would sooner or later become a maintenance nightmare. When modifying something we would need to ensure that the modification takes place in both places. Smells bad design. The second thing is that by not mixing the two caches, it would be possible to configure the stats in differently for the report and for the copyright. For example, how you wanna treat follow-copy. Perhaps you don't wanna follow-copy in a report, but for the copyright perhaps you do wanna follow-copy. This is a bit hypothetical as the follow-copy config is not available yet, but still... I guess I'm floating into a discussion that belongs in ticket:307 so...

Change History (6)

comment:1 Changed 10 years ago by Peter Johansson

(In [1226]) refs #476 and #385. Remove --copyright switch in svndigest program.

comment:2 Changed 10 years ago by Peter Johansson

(In [1227]) refs #476 and #385. Move parsing to CopyrightVisitor?. Only parse Files because Directories have no copyrigt to update.

comment:3 Changed 10 years ago by Peter Johansson

Milestone: svndigest 0.x+svndigest 0.10
Owner: changed from Jari Häkkinen to Peter Johansson
Status: newassigned

comment:4 Changed 10 years ago by Peter Johansson

(In [1358]) rewrite of svncopyright to use its own Stats class, which means it's no longer using the cache. This version is slow and there are lots of space for speed-ups and there is no cache functionality yet. refs #385

comment:5 Changed 9 years ago by Peter Johansson

(In [1364]) adding test for copyright cache. refs #385.

comment:6 Changed 9 years ago by Peter Johansson

Resolution: fixed
Status: assignedclosed

(In [1376]) closes #385. implementing cache for svncopyright

Note: See TracTickets for help on using tickets.