flowgraph-0.0.22 -->

mainly stuff to get ABfrag *cough* graphed a little..

objdump changes to use -D -m i386 -b binary --prefix-address --adjust-vma=
which fixes the total reliance on bfd being able to recognize elf binaries etc.
this makes us able to disassemble abfrag.

bug fix for call/pop sequence which ignored the call and acted like nop,
instead of an unconditional jump (bug still in the www abfrag graphs).

use Image's (shitty name i know) for loadable segments in a binary (some of
it is unused a bit, and semi broken in ways) - which are basically
greps for phdr's of type PT_LOAD.

the addition of ProcPrologue, which is a straight line disassembly scanner
for i386 frame setup which mark procedure prologues.  we use this know
as a source of flow control in conjuction with the standard entry point.
this enables us to get most of the control flow in abfrag which obviously
is not really controlled by the entry point.

some hacks because i use string.atoi() and it doesnt like > 0x80000000 and
other random broken hacks to get stuff to work for the time being.

use -Nfontsize=6 in the callgraph to get it displayable ;-)  also make
names not full PROCEDURE_0x etc, since it uses up too much space in the
graphs.
makes it so we can view larger graphs (abfrag callgraphs in particular).

add some new pure graph analysis.. divide the main callgraph (Graph) into
subgraphs consiting of disjoint graphs (DisjointGraph does this)

also add a new level of hyperlinking, so we have the list of callgraphs
on the main page which go into each individual callgraph.

i will add more things for better graphing of ABFrag and to also help manual
analysis (hehe. who would have thought it may actually be useful!), but
i will put them in 0.0.23

flowgraph-0.0.21 -->

lol.  ok - now we have the graphs in .gif being used by html ;-) we have
links going from the callgraph to procedures, with links to basic block
graphs and cfg's etc :)

i disabled some of the bug checks and features in this version, so i
can get the html shit on my www now :) i will add the rest of the stuff by
next version again i think.

some general cleanup in a few places.. ProcCall gets alot more useful now
with introduction of the ProcArgs stored in it.. plus for later -
stack correction and alignment.  this all is now being used for all the
procedure calling stuff such as the prototype bug checker.. which gets its
own class now - ProcProto.  it also goes into the src/bug/ directory
now along with Format, which is a very naive format string checker, which
should generate a nice array of FP's and FN's :)

the format checker is very silly.. if a format string is being used in anything
as an immediate value when calling a procedure.. it'll be flagged as a bug.
yah.. its sorta shitty.

seems basic block graphs are f* again.. but i think this is a problem
elsewhere in the code - only happens with some binaries that it doesnt work.

well.. it took about ~1500 lines of code, for me to start getting into
a mess with python :)  pretty bad i guess.. i would hope for something like
15000 at least before some whack bugs like this.  code is at about 1800
lines now.. not so bad i suppose for what its doing.

new classes in new directory graph/ since i cant find decent graph'ing
libraries etc (i'm sure there are some for python!).  but BBlockGraph is a new
class now.  added some graph traversal code to use later on - think i can
do some of that code better with anonymous functions etc (lambda).

the bug/LocalVarOverrun code does not work at all currently, so dont think
it does anything :)  needs some dataflow analysis to work properly - since
you cant push 0x08(%ebp) directly etc, but need intermediate registers. so
that means tracking etc :(

the etc/ directory goes up one level and out of src/

flowgraph-0.0.20 -->

added workaround/h4ck for call/pop sequences.. the graphs are now reasonably
correct as i see it.. cept for _start not having the intra->inter procedural
call paths being done preoperly - only the cg.

OK MOTHERFUCKER! (hehe).  the basic block graphs after battling with
them like i was a medi-evil knight.. are finally working.  seriously, this
code caused me MAJOR headaches.  I learnt some new python though.. id()
is pretty damn useful to look for copy by value/ref fuckups.
basic block graphing is not necessily complex, i just did it to myself when
trying to actually do it here :(

NOW.. finally the graphs are looking more useful.  and the small addition
*cough hack* for call/pop, make it rather accurate cept as stated
above - for standard, non complex things.

i will now leave the basic block graphs in the .ps i think, since it took
me so long to finally get working properly.  the code is shit, but
centralized shit that is easy to work on or replace.  and it gets the job
done for now.

flowgraph-0.0.19 -->

erm.. fixed bugs in Graph to do the basic block graphing.. also for the
DataDep, Data etc.  Basic block graphs look like they work now at least.. i've
spent at least the past 18 hours on the dataflow.. its pissing me off that
the addressing modes are making it hard to do good dataflow analysis atm,
without crazy logic :(  a possibility that will get me some quick results is
peekhole optimisation style techniques.  this can be used for instruction
re-ordering, optimisations (constant folding etc).  i will start implementing
that in the morning i think.. bed time!  i expect to have very quick
results with that once the configs are done, because there is almost no
logic behind it.  it will all be hardcoded rules based on addressing modes,
and variable matches/non matches (for peeks of something like 2 instructions
at a time).  fuq it! i sleep!

fixed a bug in the caller bug checking :)  need to fix it to ignore
rtlinkage procedures, or use a config file for this (best idea).

proc/ directory created, and Proc functionality/data split into different
classes, to better reflect how everything is organized.  also makes
life easier in the longrun.  the class hierchy etc, and data relationships
in the code here, should also reflect a 'binary program' hierarchy and
relationships (of sorts) imo.

also note that we arent creating Proc's directly in the CFG construction
now.. this is to stop the progress of a bug ;-)  when you have cfg's
creating duplicate instances of the same procedure its calling *sigh*.

python has some CongifParser libs as standard that i might check out soon
also.

we also now use ProcCall to describe inter to intra procedural flow.
this iz to replace some hacks that did it before, and make thins a bit
easier and represented.

also removed the cfg for externally linked functions.. because it breaks
the graphs right now!  it doesnt really handle the PLT in the general
case.. means that RTLibraryLinkage is a preprocssing step.

really need a symtab to do more than just print names now..  needs to
store p_val in general (proc, instr, external proc etc).  then i can
use some polymorphism to handle the name() case.

proc_external/ created where RTLibraryLinkage goes.. it does go through
the global symtab, but its erm.. hacky at the moment.. it really is. its
only there to go through the symtab. not really to be super clean atm, the
way i've done it (with Proc's being created adhocly).

i might go look for a public cvs shortly.. the code is still just beginning
to grow, but enough functionality is in it to go into a cvs i think.

flowgraph-0.0.18 -->

i DARE someone to comprehend etc/IA32.modes :-)

"(%.*)":1:"%(.*)":"RD";
"(\(%.*\))":1:"\(%(.*)\)":"RI";
"(\*%.*)":1:"\*%(.*)":"RIA";
"(\$0x.*)":1:"\$(0x.*)":"IM";
"(\*0x[0-9a-fA-F]+)":1:"\*(0x[0-9a-fA-F]+)":"IA";
"(0x[0-9a-fA-F]+)":1:"(0x[0-9a-fA-F]+)":"I";
"(0x.*?\(%.*?\))":2:"(0x.*?)\(%(.*?)\)":"N";
"(0x.*?\(%.*?,[^%].*\))":3:"(0x.*?)\(%(.*?),([^%].*)\)":"NO";
"(0x.*?\(%.*?,%.*?,.*?\))":4:"(0x.*?)\(%(.*?),%(.*?),(.*?)\)":"NO2";
"(0x.*?\(,%.*?,.*?\))":3:"(0x.*?)\(,%(.*?),(.*?)\)":"NO1";

yes.. this is a config file ;-)

DataDep classes are now here.. and DataDepNode.  plus a few classes for
instruction/data/arg helpers.

also new config etc/IA32.asm

2:mov:USE:MOD;
2:add:USE:MOD;
2:sub:USE:MOD;
2:add:USE:MOD;

its not very complete yet.. but its only test time :)

flowgraph-0.0.17 -->

work on Arch.. now we have IA32.asm and IA32.modes config files - can
you guess what we are aiming to do shortly (look at the USE/DEF/MOD stuff).

flowgraph-0.0.16 -->

time to get this f* up disassembler crap doing something useful.. ok,
Instr and Disassem get a workover.. stores shit better. parses stuff better
*objdump cough*, so this is going to be useful now it seems.  the
mnemonics get parsed/stored directly. the params get extracted, then seperated
into individual parameters.  the params also get identified for their
addressing mode (1 case so far obviously doesnt work, but its a prob with
the mnemonics, not the addressing). btw, the m_npc in Instr was complete broken
before :)  its poorly coded in this version.. but at least it is correct.  i
will simplify the bug checking now with the better instruction decoding/parsing
*objdump cough*.

new class Arch that stores regexps etc for various addressing modes of
assembly.  might be useful for other stuff also..

one day (soon i hope) it will be doing something with this disassembly..

i'm not going to bother with another .ps/.dot for this round, as nothing
has changed in the output (or at least.. it shouldnt have changed!)

flowgraph-0.0.15 -->

ok kids!  I have now implemented my 1st class of IA32 BUG CHECKING through
static analysis of binaries!

bug checker is simple (fp/fn's will happen btw with current code).

does the number of arguments a procedure uses, match the number of arguments
that it gets passed! :)

look in the .dot annotations (comments), grep for ^BUG, and see what it
says ;-)

flowgraph-0.0.14 -->

now we get into a little bit more code analysis.. determine the number of
arguments being used in a procedure (not fully robust. but works ok).  it
then tries to figure out the size of each argument (the offset diff from
the next one).

this isnt dataflow yet, but data reconstruction :)  but hey.. why not have
data reconstruction at this point ;-)

introduction of new classes like Var, ProcArgs, and ProcLocalVars etc.

gg.c is changed again to reflect new analysis.. it uses 2 functions, one
with 1 arg, the other with 4 args.

ok.. dumping this data as a comment in the .dot file, so people can see
what it does :)

flowgraph-0.0.13 -->

new class is SymTab, which is a global symbol table for the binary, that
gets filled in more as analysis continues.  this fixes the whack bug
from the last version also btw ;-)  it also make things alot easier to work
with now.  in this version we also fill in the symbolic name for
_init/_fini/main when we find them through static code recognition relocs, but
this is not directly related i guess :)

ALSO.. it now uses symbol information if the binary is not stripped. so if
it is not, you pretty much get all of the symbol names of procedures!

check out the 2 .ps's now, 1 is stripped. 1 is not.

flowgraph-0.0.12 -->

ininitial implementation at a basic block graph.. the Graph class gets
the code to make these graphs, in a generic manner.  Its not working 100%
yet in the sense of getting a final .dot graph out, but the basic meat
of the code is there and working.  just some fine tuning - with things like
conditional nodes occuping their own node.. and also calling procedures
within the cfg isnt handled properly, since its not being stored in the cfg.
oh yah, for the overlapping procedure problem. it totally breaks the
basic block graphing :)  so this has to be fixed also.

also added is the 1st code regonition.. in the case its

push %ebp
movl %esp,%ebp
subl $XXXX,%esp

^^ the XXXX we store (if the sequence matches at the start of a procedure),
to record as the total size of (auto) local variables.

its weird how much u can do with naive code recognition.  u can write simple
decompilers like this even!

flowgraph-0.0.11 -->

Graph class gets some algorithms added to it for path traversal and
subgraph/connectivity construction.. initially I wasn't planning on doing
this for this code, but now I am considering a 'metal' style bug
checker for double free's/locks/unlocks etc.. I've done this before with
my own languages (which arent really real in any sense), and it appears that
most of the necessary code to do this is here now.. the state machine, is
not written yet.. but fuck. i used to work on state machine style
programs for years - and from experience in previous bug checkers, its not
that complicated to code up.. there may be some issues with pointer
aliasing and some data flow.. i am not sure how it will go with a real
architecture.. but i think it will work quite well ;-)  if you look
at creating edges in the graph, i have a commented line using reverse.. this
is helpful for bottom up graph problems. you just keep the reverse control
flow.  sometimes its useful.  i do need to get the use/def implemented
prior to all this though :(

flowgraph-0.0.10 -->

this is the final removal of graph output dumping during graph construction.
everything is now being stored in the Program class (in Graph's + Proc's),
then a new class Printer, will dump that to .dot format.  its soonish that new
functionality can be added again, as now what can happen.. is that we
actually do graph analysis, and not just consutruction! :)  there are some
slight changes to how proc's and instructions get stored etc.  some more
things in the future will change such as symbol mechanisms in relation
to procedures and the rest.

flowgraph-0.09 -->

time to introduce the Graph class, which turns out to be remarkably quick
to do with python :)  this is going to be the formal data representation for
the graphs being generated (eg, cfg and cg).  from this, we are slowly
delaying more data being output as soon as its generated, and instead gets
stored for later output.  i avoided this earlier, cause sometimes storing
it in a graph tends to screw up how far you can go with specific types
of representation.  eg, a basic block graph, a cfg, and a cg.. do they
all get stored in the same type of graph representation?  ideally yes, but
depending on the language used, it becomes a pain in the arse.
also Proc gets some cleanup in the cfg consutrction that takes advantage
of the new abstractions.

also the Reloc class gets seperated now from linker.py . Relocation
information is standardized concept for an object format (as are symbols),
however backend implementations of course varies considerably.

it would be nice to actually have an abstraction that holds both a procedure,
a basic block, and an instruction.. because this is basically an abstraction
of what code is.  sometimes its easier to think of a program in terms
of small bits, or large bits. but they are still things that get executed
in some manner.  and they maintain flow control and data flow concepts
etc.  that way we can do some funky things later. we'll see how it goes.

i made a post to unix-virus, discussing some data flow analysis for
use in instruction re-ordering.  i'm getting to this :) - this was one of
the reasons for better Disassem/Instr abstractions remember.

flowgraph-0.08 -->

1 functionality change.. procedure exit points are not orange nodes
anymore.. but infact, orange edges back to the root of the procedure.
with the Proc abstraction, this works very nicely.

interesting thing is the graph layouts have changed.. i need to make
the green edges as short as possible.  will check dot docs to see how to
do this.

MORE code cleanup :)  moved 2/3 of the ELF handling from linker.py into
Binary.py (Shdr's + Symbols so far + Entry point).  also Sym class
is introduced now as a seperate abstraction (Binary knows how to create
some of these from in-binary data).  This is a much
better approach.. the goal is to make linker.py quite slim, and not have
it even know that its an ELF binary behind it..  This allows in the future
such things as multi binary format handling.. (if we pass a Binary instance
to the linker, this can then be polymorphic, and Binary is a base class
for things like ELFBinary).  It is soon time to incorporate something like
a graph class, to handle the control flow more abstractly. then print out
the graph at the end of the analysis - right now, we still dump most
of the information as soon as we get it.  This is one of the reasons why
its still hard to do decent PLT clustering..  because it relies upon
updating the information, or having access to previously collected
information.  by delaying the output.. we can do multiple passes, and have
in a later pass, symbol naming etc, which is adhocish atm (it relies upon
the fact, that the pc/"symbol" has not infact been named with an intermediate
representation).  also note in this version, readelf isnt getting called
as many times as before, but the data is being fully maintained in
the Binary class now ;-)  one of the reasons before to introduce the
Instr class, was to help in dataflow analysis that will hopefully appear
soonish ;-) the Instr class will maintain more information about each
instruction, such as addressing modes, registers involved, def/use status
on operands etc.  This is a bitch to do with regexps :(

flowgraph-0.07 -->

again no functional changes :) however.. cfg.py is gone now!  Program
handles the callgraph, and Proc handles the control flow graph.
it works out pretty nice..  also included a Binary class, which will
get rid of the one in linker.py *cough* (name conflicts. erm)

if your taking notes.. i'm changing the code in a couple ways.. its
getting near the point where the graph data is retained in the classes,
and not just dumped out (some might say this was possible before, but it would
have been completely broken since no classes represented anything)

also note i'm setting up possibility of polymorphsim here (see Disassem
being passed now, and m_flowgraph() is the same prototype in Proc
and Program; though i dont know if this is useful right now). note
that the cfg is done depth first. the cg is done breadth.  i was able
(in c), to make generic breadth/depth first searches and use the
same backend.. i'd live to merge in proc/program to use a generic type
of flowcontrol grapher.. and have different backends for cg's or cfg's.

an interesting thing cropped up.. i wasnt sharing the instruction tagging
(to say it had been visited) across procedures.. so when u have funky
stuff thinking that 2 procedures overlap. then it was doing multiple
passes ;-)  anyway.. i share the disassembly, and tag it as visited etc
now more formally.  the ilist code in Proc is *erm* bad.  i'll change
this to be a Instr later.. (how does python handle this? hope its not
pass by value!)

flowgraph-0.06 -->

no function changes.. addition of Instr class, and moved all real
disassembly code out of cfg.py and into Disassem and Instr classes.  this
is the best approach here.. erm. its starting to take shape of my
c code now for these types of abstractions ;-)  there is still a mad
amount of cruft, and the python implementation itself i've done is
still total shit :)  at least its still in the tiny code size stage so its
still hardish to get too lost with bad implementation and structuring.

flowgraph-0.05 -->

no functional changes.. code cleanup time since this code is actually looking
useful and more than just a small test program ;-)

so new classes like Disassem, Proc, Program etc.  converted cfg.py to
use Proc class for procedures etc.  Disassem just moved the disassembler
code (objdump!) from cfg to its own class.  Program is just a wrapper for
the rest.. main uses Program() to run etc.

--

flowgraph-0.03 lasted for about 15 minutes i think..  flowgraph-0.04 is
the current one now ;-)

flowgraph-0.04.2 -->

use 'ret|hlt' for procedure exit points instead of just 'ret'
this will graph correctly now from this point of view (of sample gg binary)

--

also show the instruction on conditional branches, as an edge
lablel.  eg, a branch label may be "je" or "jne" etc

i added this after i put up 0.04.2 on the web.. but its a 1line change, so
fuck adding a new dir.  this isnt like 'real' versioning, but more
for the sake of it.

also changed procedure exit nodes to be of colour orange ;-)

flowgraph-0.04.1 -->

fix shitty bug in linker.py which didnt correct handle multiple entries
in linker.rel (bad keylookup for inserts!).
anyway i now include _init/_fini in linker.rel, so you should have a
pretty full view of control flow at this point.

flowgraph-0.04 -->

includes direct rt linking requirements.. it does not calculate any
furthur dependancies currently (i will pull in some other code to do
this).  the edge colour is purple connecting to PROGRAM :)

flowgraph-0.03 -->

Latest addition, is the use of plt information for symbol resolution.
Its not in nice form atm.. I want the PLT entries to be clustered, but it
means organizing the earlier construction to delay outputting graph
data.. this is being worked on.

        call graph (cg)
	control flow graph (cfg)
	static code (library) recognition (.o and .so should work)
	function pointer gathering from static code recognition (configurable)
		--> as feebdack into the grapher
	symbol resolution for plt entries

        clustering (bounded boxes) of procedures.
        red edges are inter procedural control flow.
        black lines are intra procedural control flow.
        green lines are for information nodes (ie, procedures, entry point)
	yellow lines indicate control flow discovered through a function ptr.
	brown nodes are resolved symbols
	blue nodes are procedure entry points (present for a while, just noted)
	purple lines indicate runtime linking dependancies

example usage.. 

flowgraph.py binary library1 [ library2 ... ]

	$ main.py gg /usr/lib/crt1.o > gg.dot
	$ dot -Tps gg.dot -ogg.ps

eg, crt1.o is where it pulls in _start.  change linking.rel to configure
what symbols represent function pointers.
dot is by at&t bell labs.

--
Silvio
