flowgraph-0.0.15 -->

ok kids!  I have now implemented my 1st class of IA32 BUG CHECKING through
static analysis of binaries!

bug checker is simple (fp/fn's will happen btw with current code).

does the number of arguments a procedure uses, match the number of arguments
that it gets passed! :)

look in the .dot annotations (comments), grep for ^BUG, and see what it
says ;-)

flowgraph-0.0.14 -->

now we get into a little bit more code analysis.. determine the number of
arguments being used in a procedure (not fully robust. but works ok).  it
then tries to figure out the size of each argument (the offset diff from
the next one).

this isnt dataflow yet, but data reconstruction :)  but hey.. why not have
data reconstruction at this point ;-)

introduction of new classes like Var, ProcArgs, and ProcLocalVars etc.

gg.c is changed again to reflect new analysis.. it uses 2 functions, one
with 1 arg, the other with 4 args.

ok.. dumping this data as a comment in the .dot file, so people can see
what it does :)

flowgraph-0.0.13 -->

new class is SymTab, which is a global symbol table for the binary, that
gets filled in more as analysis continues.  this fixes the whack bug
from the last version also btw ;-)  it also make things alot easier to work
with now.  in this version we also fill in the symbolic name for
_init/_fini/main when we find them through static code recognition relocs, but
this is not directly related i guess :)

ALSO.. it now uses symbol information if the binary is not stripped. so if
it is not, you pretty much get all of the symbol names of procedures!

check out the 2 .ps's now, 1 is stripped. 1 is not.

flowgraph-0.0.12 -->

ininitial implementation at a basic block graph.. the Graph class gets
the code to make these graphs, in a generic manner.  Its not working 100%
yet in the sense of getting a final .dot graph out, but the basic meat
of the code is there and working.  just some fine tuning - with things like
conditional nodes occuping their own node.. and also calling procedures
within the cfg isnt handled properly, since its not being stored in the cfg.
oh yah, for the overlapping procedure problem. it totally breaks the
basic block graphing :)  so this has to be fixed also.

also added is the 1st code regonition.. in the case its

push %ebp
movl %esp,%ebp
subl $XXXX,%esp

^^ the XXXX we store (if the sequence matches at the start of a procedure),
to record as the total size of (auto) local variables.

its weird how much u can do with naive code recognition.  u can write simple
decompilers like this even!

flowgraph-0.0.11 -->

Graph class gets some algorithms added to it for path traversal and
subgraph/connectivity construction.. initially I wasn't planning on doing
this for this code, but now I am considering a 'metal' style bug
checker for double free's/locks/unlocks etc.. I've done this before with
my own languages (which arent really real in any sense), and it appears that
most of the necessary code to do this is here now.. the state machine, is
not written yet.. but fuck. i used to work on state machine style
programs for years - and from experience in previous bug checkers, its not
that complicated to code up.. there may be some issues with pointer
aliasing and some data flow.. i am not sure how it will go with a real
architecture.. but i think it will work quite well ;-)  if you look
at creating edges in the graph, i have a commented line using reverse.. this
is helpful for bottom up graph problems. you just keep the reverse control
flow.  sometimes its useful.  i do need to get the use/def implemented
prior to all this though :(

flowgraph-0.0.10 -->

this is the final removal of graph output dumping during graph construction.
everything is now being stored in the Program class (in Graph's + Proc's),
then a new class Printer, will dump that to .dot format.  its soonish that new
functionality can be added again, as now what can happen.. is that we
actually do graph analysis, and not just consutruction! :)  there are some
slight changes to how proc's and instructions get stored etc.  some more
things in the future will change such as symbol mechanisms in relation
to procedures and the rest.

flowgraph-0.09 -->

time to introduce the Graph class, which turns out to be remarkably quick
to do with python :)  this is going to be the formal data representation for
the graphs being generated (eg, cfg and cg).  from this, we are slowly
delaying more data being output as soon as its generated, and instead gets
stored for later output.  i avoided this earlier, cause sometimes storing
it in a graph tends to screw up how far you can go with specific types
of representation.  eg, a basic block graph, a cfg, and a cg.. do they
all get stored in the same type of graph representation?  ideally yes, but
depending on the language used, it becomes a pain in the arse.
also Proc gets some cleanup in the cfg consutrction that takes advantage
of the new abstractions.

also the Reloc class gets seperated now from linker.py . Relocation
information is standardized concept for an object format (as are symbols),
however backend implementations of course varies considerably.

it would be nice to actually have an abstraction that holds both a procedure,
a basic block, and an instruction.. because this is basically an abstraction
of what code is.  sometimes its easier to think of a program in terms
of small bits, or large bits. but they are still things that get executed
in some manner.  and they maintain flow control and data flow concepts
etc.  that way we can do some funky things later. we'll see how it goes.

i made a post to unix-virus, discussing some data flow analysis for
use in instruction re-ordering.  i'm getting to this :) - this was one of
the reasons for better Disassem/Instr abstractions remember.

flowgraph-0.08 -->

1 functionality change.. procedure exit points are not orange nodes
anymore.. but infact, orange edges back to the root of the procedure.
with the Proc abstraction, this works very nicely.

interesting thing is the graph layouts have changed.. i need to make
the green edges as short as possible.  will check dot docs to see how to
do this.

MORE code cleanup :)  moved 2/3 of the ELF handling from linker.py into
Binary.py (Shdr's + Symbols so far + Entry point).  also Sym class
is introduced now as a seperate abstraction (Binary knows how to create
some of these from in-binary data).  This is a much
better approach.. the goal is to make linker.py quite slim, and not have
it even know that its an ELF binary behind it..  This allows in the future
such things as multi binary format handling.. (if we pass a Binary instance
to the linker, this can then be polymorphic, and Binary is a base class
for things like ELFBinary).  It is soon time to incorporate something like
a graph class, to handle the control flow more abstractly. then print out
the graph at the end of the analysis - right now, we still dump most
of the information as soon as we get it.  This is one of the reasons why
its still hard to do decent PLT clustering..  because it relies upon
updating the information, or having access to previously collected
information.  by delaying the output.. we can do multiple passes, and have
in a later pass, symbol naming etc, which is adhocish atm (it relies upon
the fact, that the pc/"symbol" has not infact been named with an intermediate
representation).  also note in this version, readelf isnt getting called
as many times as before, but the data is being fully maintained in
the Binary class now ;-)  one of the reasons before to introduce the
Instr class, was to help in dataflow analysis that will hopefully appear
soonish ;-) the Instr class will maintain more information about each
instruction, such as addressing modes, registers involved, def/use status
on operands etc.  This is a bitch to do with regexps :(

flowgraph-0.07 -->

again no functional changes :) however.. cfg.py is gone now!  Program
handles the callgraph, and Proc handles the control flow graph.
it works out pretty nice..  also included a Binary class, which will
get rid of the one in linker.py *cough* (name conflicts. erm)

if your taking notes.. i'm changing the code in a couple ways.. its
getting near the point where the graph data is retained in the classes,
and not just dumped out (some might say this was possible before, but it would
have been completely broken since no classes represented anything)

also note i'm setting up possibility of polymorphsim here (see Disassem
being passed now, and m_flowgraph() is the same prototype in Proc
and Program; though i dont know if this is useful right now). note
that the cfg is done depth first. the cg is done breadth.  i was able
(in c), to make generic breadth/depth first searches and use the
same backend.. i'd live to merge in proc/program to use a generic type
of flowcontrol grapher.. and have different backends for cg's or cfg's.

an interesting thing cropped up.. i wasnt sharing the instruction tagging
(to say it had been visited) across procedures.. so when u have funky
stuff thinking that 2 procedures overlap. then it was doing multiple
passes ;-)  anyway.. i share the disassembly, and tag it as visited etc
now more formally.  the ilist code in Proc is *erm* bad.  i'll change
this to be a Instr later.. (how does python handle this? hope its not
pass by value!)

flowgraph-0.06 -->

no function changes.. addition of Instr class, and moved all real
disassembly code out of cfg.py and into Disassem and Instr classes.  this
is the best approach here.. erm. its starting to take shape of my
c code now for these types of abstractions ;-)  there is still a mad
amount of cruft, and the python implementation itself i've done is
still total shit :)  at least its still in the tiny code size stage so its
still hardish to get too lost with bad implementation and structuring.

flowgraph-0.05 -->

no functional changes.. code cleanup time since this code is actually looking
useful and more than just a small test program ;-)

so new classes like Disassem, Proc, Program etc.  converted cfg.py to
use Proc class for procedures etc.  Disassem just moved the disassembler
code (objdump!) from cfg to its own class.  Program is just a wrapper for
the rest.. main uses Program() to run etc.

--

flowgraph-0.03 lasted for about 15 minutes i think..  flowgraph-0.04 is
the current one now ;-)

flowgraph-0.04.2 -->

use 'ret|hlt' for procedure exit points instead of just 'ret'
this will graph correctly now from this point of view (of sample gg binary)

--

also show the instruction on conditional branches, as an edge
lablel.  eg, a branch label may be "je" or "jne" etc

i added this after i put up 0.04.2 on the web.. but its a 1line change, so
fuck adding a new dir.  this isnt like 'real' versioning, but more
for the sake of it.

also changed procedure exit nodes to be of colour orange ;-)

flowgraph-0.04.1 -->

fix shitty bug in linker.py which didnt correct handle multiple entries
in linker.rel (bad keylookup for inserts!).
anyway i now include _init/_fini in linker.rel, so you should have a
pretty full view of control flow at this point.

flowgraph-0.04 -->

includes direct rt linking requirements.. it does not calculate any
furthur dependancies currently (i will pull in some other code to do
this).  the edge colour is purple connecting to PROGRAM :)

flowgraph-0.03 -->

Latest addition, is the use of plt information for symbol resolution.
Its not in nice form atm.. I want the PLT entries to be clustered, but it
means organizing the earlier construction to delay outputting graph
data.. this is being worked on.

        call graph (cg)
	control flow graph (cfg)
	static code (library) recognition (.o and .so should work)
	function pointer gathering from static code recognition (configurable)
		--> as feebdack into the grapher
	symbol resolution for plt entries

        clustering (bounded boxes) of procedures.
        red edges are inter procedural control flow.
        black lines are intra procedural control flow.
        green lines are for information nodes (ie, procedures, entry point)
	yellow lines indicate control flow discovered through a function ptr.
	brown nodes are resolved symbols
	blue nodes are procedure entry points (present for a while, just noted)
	purple lines indicate runtime linking dependancies

example usage.. 

flowgraph.py binary library1 [ library2 ... ]

	$ main.py gg /usr/lib/crt1.o > gg.dot
	$ dot -Tps gg.dot -ogg.ps

eg, crt1.o is where it pulls in _start.  change linking.rel to configure
what symbols represent function pointers.
dot is by at&t bell labs.

--
Silvio
