LTX-2 on a 6gb card?

Nerd post.

I can now generate LTX-2 (a new open-source video generator released 5 days ago) locally on my aging laptop with 6 GB of VRAM. On paper, this shouldn’t be possible, as the model is designed to run on a 24 GB card.

Open source is the way. Fuck those data centers.

Is it Sora or Veo? No—but doing it this way avoids censorship, which is important to me. And yes, I vibe-coded the batch file to get this running on this potato. Yes i could have manually installed every dependency, and bang my head over every error, but this literally took 15 minutes to get up and running.This is not supposed to run on my system. Like, at all. The very minimum experimental profile is Profile 5, which is:

Profile 5 – VeryLowRAM_LowVRAM (Fail Safe):
At least 24 GB of RAM and 10 GB of VRAM. If you don’t have much, it won’t be fast, but maybe it will work.

I don’t even have 24 GB of RAM—only 16. Let alone 6 GB of VRAM.

Speaking of which, my specs are a Ryzen 7 3750H with an NVIDIA GTX 1660 Ti Max-Q. Not even an RTX card, which doubly shouldn’t run this.

The first video was a test run. The audio is an actual clip I uploaded to see how well it would sync—pretty well, I’d say. The eyes are a little weird, but overall it’s impressive. It’s a 3-second video that was originally 512×512 at 24 fps; I upscaled it to 1024×1024 at 30 fps with Topaz (which took less than a minute). Total generation time: 28 minutes.

The second video was to see if it could handle something longer. It’s 5 seconds, 512×512 at 24 fps, same generation time: 28 minutes. This actually shows improvement because it’s longer. I think some of the generation time improved because the model was still loaded in memory. This was supposed to be Bugs walking into a pot dispensary— kind of works. I should’ve focused more on the camera moving in close on the buds to see if it would censor anything. What I find promising so far is how close it got to Bugs’ voice. What’s even crazier is that it pulled all of this not from the internet, but locally, on my 5-year-old laptop, from less than 40 GB of models. If the apocalypse comes, I can still crank out new episodes of Looney Tunes.

Before I go any further: I am not some stoner dude-bro that always talks about pot. This was purely for *cough* research.

The third video was a test of cussing censorship. None of that here, lol—and it sounds very distinctly like Bugs. Maybe not 100%, but passable, I’d say. The reason these videos look very similar so far is because the prompt wasn’t changed much. It’s a detailed prompt as well, which is one thing about this model: it wants specifics. This one also took 28 minutes.

For the fourth video, I tried to push out a 10-second clip. This one extends the cussing to make up for the longer runtime. I also wanted a shot focusing on bud containers, and I put a label on one jar that says “White Widow.” This was a good test for text handling. It handled the long video just fine, and impressively at a generation time of 34 minutes—double the length for only 6 more minutes of render time. That seems like the way to go for efficiency. The model can go up to 30 seconds, so I’ll have to test that eventually. Unfortunately, it totally failed the text test. The jar label looks like Cheez-Its, and the jars themselves only kind of pass as dispensary-type containers. I think I just need to be clearer in the prompt. I also didn’t specify a style for the Bugs Bunny scene, so the Pixar vibe is fine.

The fifth video used an actual image of George W. Bush giving an unlikely speech. I uploaded the image and gave a prompt describing what I wanted. 10 seconds, generation time 38 minutes, though I did close out and restart the generator. I’m not happy with the result—it sounds like if George W. and Obama had a baby. Facial animation is off, there’s weird overlay text, and the joint looks more like a banana.

For the sixth video, I ran the exact same prompt, length, and dimensions to test whether generation would be faster with the model already loaded. It came in at 37 minutes, so not much difference. I suspect uploaded images increase render time. The result was very strange. The source image isn’t great—it kind of looks like two faces—and I could probably improve the prompt. It literally starts sounding like Trump, and Bush’s facial expressions start mimicking Trump too. Kind of an interesting hallucination.

For the seventh video, I did another audio upload and let it generate the video—probably the most interesting feature. I made a custom track with some cheap online audio generators and ended up with a decent SpongeBob clip. I figured if something as IP-heavy as SpongeBob gets through, anything can. I liked the result overall, aside from sketchy eyes and wrong character voices. The voice mix-ups happen often across all major models. There is a prompt trick to fix this that I didn’t use here. This was the longest generation so far at 40 minutes.

For the eighth lol-wtf video, I ran the SpongeBob prompt again, this time using the voice-mismatch fix and messing with weights for optimization. Still had dialogue mismatch, so I’ll need another approach. The tweaking helped, though—it took 33 minutes (ding-ding). I’ll adjust further.

For the ninth video, I used the same prompt but structured very differently, with the same audio. I tweaked weights more and enabled “Video Motion Ignores Background Music” to improve lip-sync, which added about a minute to render time since it separated the vocal stem. Halfway through, I realized the prompt was completely messed up—basically no prompt at all (operator error). So yeah, dialogue mismatch again. And of course, it added Patrick instead of Squidward. Still, given it was working almost entirely from audio, it wasn’t terrible. It generated in 34 minutes, including audio separation. I’m done messing with weights—they’re probably optimal.

I fixed the prompt for video 10 (are we really doing this? God, I have no life). Hopefully this works, or we’ll be limited to one-character scenes. A half hour per run is too costly for mistakes. I was browsing the web during this one to see if it affected timing. It took 36 minutes, not much slower despite some heavier browsing. Dialogue was still busted. I think the model processes speech left-to-right, so I’m going to flip their positions and test that.

Video 11: I changed the prompt so Squidward is on the right. It started correctly, then failed again. I dunno. I probably won’t use custom-audio video generation for more than two characters anyway. 34-minute generation time.

JSMTV (Just Shoot Me Twelfth Video): I grabbed the prompt from my most popular slop video (War Shiba Shoots Drone) that I used in Sora, fancied it up, and tested it here. All prompt—no images or audio uploaded. Resolution 624×832 (3:4). Unfortunately, it hung and had to be restarted. It then glitched and used my old prompt. It claims 2 hours and 24 minutes, but I can’t trust that due to the hang-up. Biggest fuck-up: it’s a similar scene, but the human was supposed to be a Shiba, and he completely misses the drone. I’ll give the model credit—the guy definitely looked Slavic.

If it weren’t for the mess-up, it would’ve been a damn good generation. The hang-up probably screwed it. It resumed generation on restart, which likely caused issues. I won’t be testing this resolution again for a while. Now that I think about it, 512 may be the max resolution this card can handle. That wait time is absurd. I think it thought it was rendering a longer video when it only output 10 seconds—a weird bug.

Video 13: Don’t worry, I left this running and went to touch more grass. I attempted a 30-second video with a prompt about old-school WWF wrestlers fighting over cocaine. It warned me about creating a sliding window (two videos stitched together), so maybe 15 seconds is the real limit. Woke up to see it took 3 hours and 45 minutes. Not viable. I’ll definitely keep it at 10 seconds. The video is hilarious and looks decent, but it didn’t nail any of the wrestlers despite being specific. The Undertaker kind of looks like the Undertaker… but not really. It’s clearly two similar videos stitched together. Meh.

I know these times sound terrible, and you can’t really do anything on your computer besides type in Notepad (which is why this post is so long). But honestly, you can queue videos and let them run while you’re at work or asleep—you just need planning. There’s also a continue video option, and I believe you can extend your own uploaded videos as well.

Overall, I’m pretty happy with the results, despite definite limitations. Stability is surprisingly good—only one hang-up out of 13 runs on this card. I think this is a great model with a lot of potential. With a better GPU, render times would drop dramatically. What I like most compared to closed market models is the level of control. Those can be hit or miss, but they are easier to prompt. This one gives you much more control, but demands precision.

I think I smell my video card.