
[{"content":"","date":"27 March 2026","externalUrl":null,"permalink":"/","section":"\u003e kin stash","summary":"","title":"\u003e kin stash","type":"page"},{"content":"","date":"27 March 2026","externalUrl":null,"permalink":"/tags/gannett/","section":"Tags","summary":"","title":"Gannett","type":"tags"},{"content":"","date":"27 March 2026","externalUrl":null,"permalink":"/tags/ocr/","section":"Tags","summary":"","title":"OCR","type":"tags"},{"content":"","date":"27 March 2026","externalUrl":null,"permalink":"/tags/okapi/","section":"Tags","summary":"","title":"Okapi","type":"tags"},{"content":" I needed to fix scannos in tens of thousands of line-based text files, so I built a tool called Okapi on top of ripgrep to let me find them in context and fix them in bulk using my text editor. Install it with homebrew.\nThe project is digitizing tens of thousands of pages of US Government employee data. It\u0026rsquo;s called the Official Register, and there are over 100 volumes spanning 150 years. I\u0026rsquo;ve had great success with olmOCR, as it\u0026rsquo;s far more accurate than vanilla Tesseract. But it still generates many, many scannos.\nDouble-U Double-U III # An example of a really common character sequence produced by the OCR which is almost never right is III. In a few rare cases, that really does mean that the person has the same name as his father and grandfather. But the text I was working on looked like this:\nThat means that Richard worked for the Isthmian Canal Commission as a Painter for 68¢/hr. He was born in Illinois, was appointed to his position (hired) in the 11th Congressional District of Illinois, county of Will, and stationed in the Panama Canal Zone. (This is from page 144 of the 1909 edition, volume 1.)\nAs you can see, there can be an awful lot of \u0026ldquo;vertical line\u0026rdquo; characters! This is a bit of a stress test for any OCR, and the image quality didn\u0026rsquo;t help matters. Suffice it to say, III in my dataset is almost always a scanno. Here are just a few examples:\nOCR text Corrected text RosedaleIII RosedaleIll Rock IslandArsnIllIII RockIslandArsnlIll RockIsland ArsmllIII RockIslandArsnlIll IIIllaSalle IllLaSalle 2IIIISangamon 21IllSangamon IIIpr Hlpr NIIIVS NHDVS And there are many more permutations. Really, III is an indicator that something is wrong with the text. Just blindly replacing it with Ill would fix some cases, but it would often generate incorrect text and would actually serve to make other scannos lurking nearby that much harder to find.\nFinding strings of interest by regex using ripgrep was good, but it was still much too slow. Sometimes there are hundreds of matches, I couldn\u0026rsquo;t be opening these files one at a time to figure out the context and edit/save a line or two. I needed the precision of regex combined with the power of a text editor.\nEnter Okapi:\n$ okapi III $ okapi \u0026#34;Dan[^l ]\\b\u0026#34; # Should probably be the abbreviation \u0026#34;Danl\u0026#34; $ okapi \u0026#34;Mich\\wl\u0026#34; -e \u0026#34;Michel\u0026#34; # Pass an exclude pattern $ okapi Fli -c ..15 # Restrict matches to the first 15 chars Now I can edit similar errors across files without needing to know exactly what the replacement string is. I can have a page of matches visible at a glance. I can also use the full power of multi-select. Asciinema demos notwithstanding, I am using Sublime\u0026rsquo;s multi-select features to grab every instance of some subset pattern and change them all at once. Then I can surgically find and change another subset. Once I\u0026rsquo;m done, everything is saved back to disk.\nHere\u0026rsquo;s what the edit buffer looks like:\n# --- Begin editable lines --- A 76 ▓ — Richd G, IsthCnlCmsn Pntr $0.65ph Ill 11Ill Will CnlZ B 22 ░ Richd G, IsthCnlCmsn Clk $125pm Eng 5GaFulton CnlZ B 40 ░ Mrs Rosa J, Treas PrtnrsAsst Engrv\u0026amp;Prntg $1.50pd DC DC DC # --- File Aliases --- # A = /Users/nick/offreg/olmocr/1909/150-column_0.md # B = /Users/nick/offreg/olmocr/1909/708-column_1.md I was inspired by git\u0026rsquo;s interactive rebase interface. The alias letters in the first column tie the line to its file. Then there\u0026rsquo;s the line number and a separator character which also visually breaks the lines up by file. The aliases are limited to 3 uppercase alpha characters right now, but that still gives one over 18,000 files. That ought to be enough for anybody!\nText, Meet Image # Having all the matching lines in a single buffer is a major step forward, but in my case it wasn\u0026rsquo;t enough. I also needed a way to show the original image right under the text for a given line. That would let me see the ground truth with maximum efficiency. For example, Wrn should clearly be Wm. But something like the prefix Fli is ambiguous. In some cases, this should be Eli (Elizabeth). But equally correct might be Fli (Flin) or Ell (Ella). Context is key.\nHere\u0026rsquo;s the strategy I used:\nEach column image (OR pages have 2 columns) is run through Tesseract. Yes, this is a second, lower-quality OCR. But it has one major advantage: the line results come with bounding boxes. The Tesseract data is precomputed and saved to a JSON file next to the image. I\u0026rsquo;ve built another Rust tool which, given a text file in the corpus and a line number, does a fuzzy match between the contents of that line and all the Tesseract lines. It normalizes the strings first and uses a trigram match, identifying the best match. Once it has the bounding box, it loads the column image and returns the region of interest as a UUEncoded string. I have a Sublime plugin which calls the Rust tool every time the cursor changes line in a Markdown file in the target directory, or an Okapi buffer file. The plugin displays the image in an HTML overlay. And guess what? It works!\nFeedback welcome # If you\u0026rsquo;d like more information, please get in touch. I\u0026rsquo;m happy to accept Issues or PRs in the Okapi repo. The database of names isn\u0026rsquo;t online yet, but that is coming soon!\n","date":"27 March 2026","externalUrl":null,"permalink":"/post/6/okapi-or-what-if-ripgrep-could-edit/","section":"Posts","summary":"","title":"Okapi, or “What if \u003ccode\u003eripgrep\u003c/code\u003e Could Edit?”","type":"post"},{"content":"","date":"27 March 2026","externalUrl":null,"permalink":"/post/","section":"Posts","summary":"","title":"Posts","type":"post"},{"content":"","date":"27 March 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":" Memo-what? # The term memoization can sound intimidating at first. But it boils down to a very simple concept: only do hard things once.\nImagine you have a function that takes awhile to complete. This might be because it requires many CPU cycles or because it needs to fetch something from the network.\nuse log::{info, LevelFilter}; use simple_logging; fn main() { simple_logging::log_to_stderr(LevelFilter::Info); for _ in 1..=5 { let res = slow_loris(); info!(\u0026#34;Result: {:?}\u0026#34;, res) } } fn slow_loris() -\u0026gt; i32 { // For brevity, I\u0026#39;ve made a macro to sleep for n seconds sleep!(3); 42 } With 5 calls, each lasting 3 seconds, this takes 15 seconds to fetch all the results:\n[00:00:03.005] (1f6388c00) INFO Result: 42 [00:00:06.010] (1f6388c00) INFO Result: 42 [00:00:09.014] (1f6388c00) INFO Result: 42 [00:00:12.015] (1f6388c00) INFO Result: 42 [00:00:15.018] (1f6388c00) INFO Result: 42 But we know that the function will always return the same result! We don\u0026rsquo;t want to have to wait for it to calculate the same answer again. How can we speed this up?\nWell, we could save the result, right? Wrap the function in some kind of if check and create an Optional variable that we populate the first time through. Then, when we enter the function, we can check the variable and, if it has a value, just skip the function entirely and return the cached value.\nCongratulations! You now understand function memoization.\nThe cached crate # Let\u0026rsquo;s install the cached crate and add a macro to our slow_loris function:\nuse cached::proc_macro::once; // ... #[once] fn slow_loris() -\u0026gt; i32 { sleep!(3); 42 } And that\u0026rsquo;s all it takes to bring our 15 second runtime down to 3 seconds:\n[00:00:03.005] (1f6388c00) INFO Result: 42 [00:00:03.005] (1f6388c00) INFO Result: 42 [00:00:03.005] (1f6388c00) INFO Result: 42 [00:00:03.005] (1f6388c00) INFO Result: 42 [00:00:03.005] (1f6388c00) INFO Result: 42 The first time the function is called, it will be run and the result will be cached. After that, the cached result will always be returned.\nHandling arguments # But often, your function will have have one or more arguments which affect the result. Memoization can still be useful if the function is called frequently with repeated arguments. Enter the #[cached] macro. By default, this macro will generate a cache key using the value of all function arguments. So let\u0026rsquo;s tweak our sample program slightly and pass an argument.\nfn main() { simple_logging::log_to_stderr(LevelFilter::Info); for _name in [\u0026#34;a\u0026#34;, \u0026#34;b\u0026#34;, \u0026#34;c\u0026#34;] { for i in 1..=5 { let result = fetch_result(i); info!(\u0026#34;Result: {:?}\u0026#34;, result) } } } #[cached] fn process(i: i32) -\u0026gt; i32 { sleep!(i); 15 * i } Now, cached assumes that the argument will affect the return value and inserts each result in the cache under the value of the argument.\n[00:00:01.003] (1f6388c00) INFO Result: 15 [00:00:03.005] (1f6388c00) INFO Result: 30 [00:00:06.010] (1f6388c00) INFO Result: 45 [00:00:10.011] (1f6388c00) INFO Result: 60 [00:00:15.014] (1f6388c00) INFO Result: 75 [00:00:15.014] (1f6388c00) INFO Result: 15 \u0026lt;--- cache kicks in [00:00:15.014] (1f6388c00) INFO Result: 30 [00:00:15.014] (1f6388c00) INFO Result: 45 [00:00:15.014] (1f6388c00) INFO Result: 60 [00:00:15.014] (1f6388c00) INFO Result: 75 [00:00:15.015] (1f6388c00) INFO Result: 15 [00:00:15.015] (1f6388c00) INFO Result: 30 [00:00:15.015] (1f6388c00) INFO Result: 45 [00:00:15.015] (1f6388c00) INFO Result: 60 [00:00:15.015] (1f6388c00) INFO Result: 75 And what if we want to key results on some of the function arguments but not others? We can do that too, using the macro\u0026rsquo;s convert attribute. Now the function will only be run if it\u0026rsquo;s called with a value for i that it has never seen before:\nfn main() { simple_logging::log_to_stderr(LevelFilter::Info); for name in [\u0026#34;a\u0026#34;, \u0026#34;b\u0026#34;, \u0026#34;c\u0026#34;] { for i in 1..=5 { let result = process(i, name); info!(\u0026#34;Result: {:?}\u0026#34;, result) } } } #[cached(key = \u0026#34;i32\u0026#34;, convert = \u0026#34;{i}\u0026#34;)] fn process(i: i32, name: \u0026amp;str) -\u0026gt; i32 { println!(\u0026#34;Processing: {} \u0026#39;{}\u0026#39;\u0026#34;, i, name); sleep!(i); 15 * i } Working with functions that return Result # By default, cached saves the return value of your function exactly. But if your function returns a Result that you want to unwrap, use the result = true attribute. This will check return values before they get added to the cache and discard Err results. That way, if a given set of arguments generates an Err, the function will try again the next time through.\nOther ways to use cached # All of the above caches are in-memory, but you can also have cached store its cache on disk or in a Redis server. And the cache is by default unbounded, meaning it will store anything you give it until the program exits—or runs out of memory, falls over, and then exits! To set a size limit on the cache, use a SizedCache like so:\n#[cached( ty = \u0026#34;SizedCache\u0026lt;(i32, i32), i32\u0026gt;\u0026#34;, create = \u0026#34;{ SizedCache::with_size(10) }\u0026#34; )] fn multiply(i: i32, j: i32) -\u0026gt; i32 { println!(\u0026#34;Processing: {}*{}\u0026#34;, i, j); sleep!(j); i * j } Handling multiple threads or tasks # If you use #[cached] on a function but then call it from multiple threads or tasks, you\u0026rsquo;ll likely get incorrect results. Execution can be interrupted by a second call to the function, and then there will be a race condition to enter the result in the cache. Fortunately, making sure that results are entered in the order the function was called just requires using the sync_writes attribute:\n#[cached( key = \u0026#34;i32\u0026#34;, convert = \u0026#34;{i}\u0026#34;, sync_writes = true )] fn process(i: i32, name: \u0026amp;str) -\u0026gt; i32 { println!(\u0026#34;Processing: {} \u0026#39;{}\u0026#39;\u0026#34;, i, name); sleep!(i); 15 * i } When not to use memoization # Note that doing this sort of caching could be a performance bottleneck if the function takes a long time to complete and is called with many different inputs (generating many cache misses). The locking needed for sync_writes could also be slower than doing it yourself if the actual critical region is smaller than the whole function.\nIn the end, performance is an empirical art. Humans aren\u0026rsquo;t very good judges of what a computer will be able to do quickly, so it\u0026rsquo;s important to use benchmarking to figure out how changes are affecting your code\u0026rsquo;s performance. Here is a good introduction to the criterion benchmarking library.\nThe good news is that cached makes memoizing a function very quick. If your benchmarks tell you it didn\u0026rsquo;t help, or actually slowed things down, you won\u0026rsquo;t have wasted very long setting it up!\n","date":"20 March 2026","externalUrl":null,"permalink":"/post/5/caching-expensive-functions-in-rust/","section":"Posts","summary":"","title":"Caching Expensive Functions in Rust","type":"post"},{"content":"","date":"20 March 2026","externalUrl":null,"permalink":"/tags/performance/","section":"Tags","summary":"","title":"Performance","type":"tags"},{"content":"","date":"20 March 2026","externalUrl":null,"permalink":"/tags/rust/","section":"Tags","summary":"","title":"Rust","type":"tags"},{"content":"I needed to add a custom font to a project today and had real trouble finding how to do it. There were lots and lots of iOS tutorials, with this being the best of them. But I was doing a Mac OS app, not an iOS app, and so the tutorial didn\u0026rsquo;t quite fit my needs.\nIn the end, a coworker helped me figure out what I was doing wrong. So here\u0026rsquo;s a quick set of steps, based on the Code with Chris link above.\nInclude the fonts in your Xcode project.\nMake sure they\u0026rsquo;re included in the target you\u0026rsquo;re building.\nDouble-check that they\u0026rsquo;re being copied as resources into the product bundle.\nInclude the folder with your fonts in the app Plist. On Mac OS you need to use ATSApplicationFontsPath instead of UIAppFonts. Important: Don\u0026rsquo;t think you can leave this out just because your fonts aren\u0026rsquo;t in a subfolder! Even if they\u0026rsquo;re loose in Resources, you need to supply a path. Just use \u0026ldquo;.\u0026rdquo; in that case.\nFind the name of the font. On OS X, you can use this line:\nNSLog(@\u0026#34;%@\u0026#34;,[[NSFontManager sharedFontManager] availableFontFamilies]); Use NSFont and NSAttributedString to create a string using the font:\nNSFont *font = [NSFont fontWithName:@\u0026#34;MyFont\u0026#34; size:20.0]; NSDictionary *attributes = @{NSFontAttributeName : font}; NSAttributedString *attString = [[NSAttributedString alloc] initWithString:@\u0026#34;But a virgin Wurlitzer heart never once had a song\u0026#34; attributes:attributes]; ","date":"13 August 2014","externalUrl":null,"permalink":"/post/4/adding-a-custom-font-to-an-xcode-project-for-mac-os/","section":"Posts","summary":"","title":"Adding a custom font to an Xcode project for Mac OS","type":"post"},{"content":"","date":"13 August 2014","externalUrl":null,"permalink":"/tags/fonts/","section":"Tags","summary":"","title":"Fonts","type":"tags"},{"content":"","date":"13 August 2014","externalUrl":null,"permalink":"/tags/xcode/","section":"Tags","summary":"","title":"Xcode","type":"tags"},{"content":"Today I wanted to call a subprocess and get its output. Something like this:\narguments = [\u0026#39;git\u0026#39;, \u0026#39;log\u0026#39;, \u0026#39;-1\u0026#39;, \u0026#39;--pretty=format:\u0026#34;%ct\u0026#34;\u0026#39;, self.path] timestamp = check_output(arguments) This worked great when I was staying in the source tree of a single git project. However, as soon as I asked git to get the log of a file outside the current source tree, it returned nothing. Clearly, I needed to change the CWD first.\nIn looking at the docs, I found a cwd argument on Popen, but not check_output. But this helpful post suggested that I could still pass the cwd argument because che/code\u0026gt; and friends use Pop/code\u0026gt; underneath. And it works! Thanks to Shrikant for the tip. So the final code looks like this:\narguments = [\u0026#39;git\u0026#39;, \u0026#39;log\u0026#39;, \u0026#39;-1\u0026#39;, \u0026#39;--pretty=format:\u0026#34;%ct\u0026#34;\u0026#39;, os.path.basename(self.path)] timestamp = check_output(arguments, cwd=os.path.dirname(self.path)) ","date":"14 April 2014","externalUrl":null,"permalink":"/post/3/changing-cwd-while-using-subprocess-convenience-methods/","section":"Posts","summary":"","title":"Changing \u003ccode\u003eCWD\u003c/code\u003e while using \u003ccode\u003esubprocess\u003c/code\u003e convenience methods","type":"post"},{"content":"","date":"14 April 2014","externalUrl":null,"permalink":"/tags/python/","section":"Tags","summary":"","title":"Python","type":"tags"},{"content":"Recently I was looking for a Python class that would let me easily store and manipulate a list of file paths. I pretty quickly found distutils.filelist, which sounded like exactly what I needed, so I read the docs… and came away with no clear understanding of how to actually USE the class. Furthermore, Googling turned up nothing either.\nThe good news is that the class is quite compact. (Here is the source.) And since it\u0026rsquo;s so simple, I was quickly able to work out how to use it. The key thing to know is that there are two variables: allfiles and files. The first contains everything, and you fill it up either using findall() or set_allfiles(). Once you have some files in allfiles, you can include them in the reduced list of files via include_pattern(). After you\u0026rsquo;ve selected some items, just read the files property.\nIf instead you only want to remove files/paths matching a pattern, you can put items into files directly and then cull the ones you don\u0026rsquo;t want with exclude_pattern(). Of course, you can also combine the two methods.\nSo this is the FileList example that the docs should contain, but don\u0026rsquo;t:\nimport distutils.filelist as fl fileList = fl.FileList() fileList.findall() # By default, uses CWD fileList.include_pattern(\u0026#39;*.xib\u0026#39;, anchor=False) print fileList.files This will recursively look through all the files in the current directory and pull out the .xibs. What if the files aren\u0026rsquo;t on disk, but instead have come from a zip file? No problem:\nimport distutils.filelist as fl fileList = fl.FileList() zipfile = ZipFile(StringIO(someZipData)) fileList.extend(zipfile.namelist()) # Extends \u0026#39;files\u0026#39;, not \u0026#39;allfiles\u0026#39; fileList.exclude_pattern(\u0026#39;__MACOSX/*\u0026#39;) # Ignore OS X attribute directories fileList.exclude_pattern(\u0026#39;\u0026#39;, prefix=\u0026#39;.\u0026#39;) # Ignore hidden files print fileList.files\u0026lt;/pre\u0026gt; I hope these examples save you some time!\n","date":"21 March 2014","externalUrl":null,"permalink":"/post/2/using-pythons-filelist-class/","section":"Posts","summary":"","title":"Using Python's \u003ccode\u003eFileList\u003c/code\u003e class","type":"post"},{"content":"I admit it, I\u0026rsquo;m a stickler for pretty code. One thing that offends my sense of aesthetics is when the #import block at the top of a .m file is all ragged. Much better to have them sorted by line length. But what about comments? And shouldn\u0026rsquo;t lines of the same length be sorted alphabetically?\nI had a script to do this in Xcode 3, but then Xcode 4 came along and did away with scripts. But that\u0026rsquo;s because there\u0026rsquo;s now a way to do it with Automator. Follow these instructions:\nhttp://stackoverflow.com/questions/8103971/sort-lines-in-selection-for-xcode-4\nAnd then set the Shell to /usr/bin/perl and put this code in the Automator \u0026ldquo;Run Shell Script\u0026rdquo; action:\nmy @l; my $chomped = 0; sub trim { ($trimmed, @drop) = split(q-//-, $_[0]); $trimmed =~ s/\\s+$//; return $trimmed; } while (\u0026lt;\u0026gt;) { $l[$.] = $_; } # Remove the last line if it\u0026#39;s just a newline if (length($l[$#l]) == 1) { $chomped = 1; pop(@l); } @sorted = sort { length trim($a) \u0026lt;=\u0026gt; length trim($b) or lc($a) cmp lc($b) } @l; print @sorted; if ($chomped) { print \u0026#34;\\n\u0026#34;; } Finally, visit  \u0026gt; System Preferences… \u0026gt; Keyboard \u0026gt; Keyboard Shortcuts \u0026gt; Services and assign your new service a shortcut. (I chose Cmd-Opt-Ctrl-S.) And away you go.\n","date":"16 October 2012","externalUrl":null,"permalink":"/post/1/sort-lines-by-length-and-then-alphabet-in-xcode/","section":"Posts","summary":"","title":"Sort lines by length and then alphabet in Xcode","type":"post"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"}]