Opened 14 years ago

Last modified 9 years ago

#102 new enhancement

Counting characters instead of lines

Reported by: Peter Johansson Owned by: Jari Häkkinen
Priority: major Milestone: svndigest 1.0
Component: core Version: trunk
Keywords: Cc:


Would it be possible to be character based rather than line based?

This would be more fair to developers because the atom of writing is not the line but the character. Also it makes the copyright functionality more reasonable to configure. You can no longer lose the ownership of a line but will still own the characters you have contributed. It would still make sense to distuingish between code, doc, and empty (even though empty probably should be renamed).

How expensive would this be?

Change History (7)

comment:1 Changed 12 years ago by Jari Häkkinen

Maybe we should have statistics on statements rather than lines?

comment:2 Changed 11 years ago by Jari Häkkinen

The current ownership of code is based on lines, i.e., if you change one character in a line you get the ownership of that line. The ownership should be based on segments of text (strings) rather than lines. The smallest unit is a one character string. Utilizing strings rather than lines would enable statistics based on character count.

The segments would be defined by commits like

Revision 1

cout << "Hello World" << endl;


Revision 2

cout << "Hello World!" << endl;

After revision 1 there will be one string (segment), the complete one line. After revision 2 there would be 3 strings (omitting whitespace characters since we'd like to do that)

cout<<"Hello World

And so on. (The blame information pages will be interesting to create with code segments coloured in shades of gray to indicate age of code.)

The cache system needs to be reworked of course. The string comparisons might be a nightmare but hey, they could be based on line diffs and only lines diffing needs to be analyzed. trac already does the line diff analyze so lets lend code.

comment:3 Changed 11 years ago by Peter Johansson

IIRC trac's line diff is sub-optimal and only allows a limited number of break points, but it might have been improved since I noticed that.

We already have the ssearch function in yat, which essentially does a diff. It depends on GSL, however, so we'd need to re-write it.

comment:4 Changed 11 years ago by Peter Johansson

Ran into this article which might be useful

comment:5 Changed 11 years ago by Peter Johansson

About the cache: First we need which revision each segments comes from, which could be saved as 100:3, 53:5, 2:3, which means 100 first chars belong to rev 3, then come 53 chars from rev 5, and then 2 chars from rev 3. The good thing about this format is that it is local in the sense that if we later insert chars values are only changed for modified segments.

Second, we need to know who belong to rev 3 etc; this could be retrieved from svn log, but it might be worthwhile to cache the (partial) log as 3:jari, 5:peter so we don't need to contact repository when not needed.

Third we need to know if chars is code or comment; We need to parse out this for every rev (as we do now). It is tempting (to me) to only parse code/comment from the HEAD rev but it wont work because what is code in one rev may be a comment in next rev because the coder (commenter) decided to comment out the code. Instead, when retrieving stats for code, we should ignore comments just like we ignore white-space chars.

comment:6 Changed 9 years ago by Peter Johansson

ticket #518 was marked as related

comment:7 in reply to:  3 Changed 9 years ago by Peter Johansson

Replying to peter:

We already have the ssearch function in yat, which essentially does a diff. It depends on GSL, however, so we'd need to re-write it.

ssearch in yat uses NW. A better choice is probably Hirchberg's Algorithm because it uses less space (linear).

Note: See TracTickets for help on using tickets.