Expense Classification

Hey @jchiu,

I use ledger for expense categorization. I have essentially these categories as you can imagine, all predefined. And a whole bunch of past expenses which have been pre-categorized as well. I want to run the new expenses through a categorization algorithm, which can do a decent job at finding the right categories.

Note that a lot of expenses have only a few words, so categorizing them can be challenging. Also, I’d like to add a few words attaching them to categories myself, which ideally should have a higher weight compared to other trained words. Alternatively, I could just hard code them as the top result, and let a non-weighted algorithm run it’s course, picking the top few categories from it as well.

Which algorithm is best suited for this job? Naive Bayesian Classification with TF-IDF support? Any others?

Your features could be unigrams and bigrams. You can try TF-IDF but I think for a start, you can try dumping them in as is. It should be quite convenient to use scikit. It has preprocessors like this. Naive Bayes is a good one to try first.

To add words to categories, you can augment your dataset. You can give extra weight to the rows you add.

Sounds like a fun project! Hope this helps.

I’m not using Python. I wonder if they have the scikit in Go. There’s a library in Go, which does TF-IDF:

This lib does Naive Bayes + TF-IDF. Any thoughts about this? Should I also generate bigrams?


Update: Actually, looks like this is working really well. In a couple of hours, with a crying baby in hand, I was able to build a much better expense categorization tool than this thing that I was using and wrecking my head against for months! People don’t improve their tools, annoying.

3 Likes

I think as long as it works fine, there is no need to go for bigrams.

I know you don’t like Python. I agree it is really bad for production (just like Javascript) but it is great for some one-off and quick-and-dirty jobs, such as “build the model, save to a file” and then use Go / C++ to apply the model.

1 Like

I’m building the model and just keeping it in memory. The time it takes to build the model is almost negligible. Also, the other thing I’m going to try (not sure if the library supports that) is to train the model as new expenses are being categorized.

I used to use Python for scripts – but have largely moved away from that. I think Go is a great language to build long-lasting, maintainable command-line tools. Compilation of code is a great help to me when coding. Mostly by the time my code compiles, it works. That’s never the case with Python. Also, refactoring in Python sucks, which I do quite a lot for these tools, as things become clear and I can think of better ways to achieve the same functionality, and add more on top. Overall, I’m no longer using Python for anything.

1 Like

Hi,

Over the christmas holiday I finally got my python script up and running that I’d been thinking about for a while. It does pretty much what you want to do. For the transaction description is use one-hot encoding with sklearn naive bayes against existing expenses. The probabilities for each output class is then fed as features in a random forest classifier along with transaction features such as amount, debit/credit, weekday (0-6) and whether it took place on a weekend.

For my expenses this usually gets me some 100% matches and usually a suitable account in the top 2-3. Input is OFX files, output is entries directly into gnucash sqlite database, but text output is planned.