@Service public class IngestionPipeline private final TokenTextSplitter splitter = new TokenTextSplitter(500, 100); // 500 tokens per chunk private final VectorStore vectorStore; private final EmbeddingClient embeddingClient; @Autowired public IngestionPipeline(VectorStore vectorStore, EmbeddingClient embeddingClient) this.vectorStore = vectorStore; this.embeddingClient = embeddingClient;
@Service public class PdfDocumentService public List<Document> parsePdfs(List<byte[]> pdfBytesList) return pdfBytesList.stream() .flatMap(bytes -> ByteArrayInputStream bais = new ByteArrayInputStream(bytes); TikaDocumentReader reader = new TikaDocumentReader(bais); return reader.get().stream(); // Returns List<Document> ) .collect(Collectors.toList()); spring ai in action pdf github
@RestController public class ChatController private final ChatClient chatClient; private final VectorStore vectorStore; @GetMapping("/ask") public String askAboutGitHubPdfs(@RequestParam String question) // Retrieve relevant PDF chunks List<Document> relevantDocs = vectorStore.similaritySearch(question); // Create system prompt with context String context = relevantDocs.stream() .map(Document::getText) .collect(Collectors.joining("\n---\n")); return chatClient.call(new Prompt( List.of(new SystemMessage("Answer based only on: " + context), new UserMessage(question)) )).getResult().getOutput().getText(); | | Rate limiting (GitHub API) | Implement
This is an excellent topic, as it sits at the intersection of a popular framework (Spring AI), a specific resource format (PDF), and a vital developer platform (GitHub). | | Metadata tracking | Add Document metadata:
Below is a structured, actionable "paper" – more accurately, a – on the topic "Spring AI in Action: Leveraging PDF Data via GitHub Repositories."
public void indexPdfsFromGitHub(String repo, String pdfPath) List<byte[]> pdfs = gitHubPdfFetcher.fetchPdfsFromRepo(repo, pdfPath); List<Document> rawDocs = pdfDocumentService.parsePdfs(pdfs); List<Document> chunkedDocs = splitter.apply(rawDocs); // Store in vector DB vectorStore.add(chunkedDocs);
spring-ai-pdf-github-demo/ ├── src/main/java/com/example/ │ ├── config/VectorStoreConfig.java │ ├── service/GitHubPdfFetcher.java │ ├── service/PdfDocumentService.java │ ├── pipeline/IngestionPipeline.java │ └── controller/ChatController.java ├── src/main/resources/application.yml ├── docker-compose.yml (for PGVector) ├── README.md └── sample-pdfs/ (for testing) spring: ai: openai: api-key: $OPENAI_API_KEY embedding: options: model: text-embedding-ada-002 vectorstore: pgvector: index-type: HNSW distance-type: COSINE_DISTANCE datasource: url: jdbc:postgresql://localhost:5432/vectordb github: token: $GITHUB_TOKEN 5. Best Practices & Troubleshooting | Challenge | Solution | |-----------|----------| | Large PDFs > 10MB | Use GitHub's blob API with range requests. | | Rate limiting (GitHub API) | Implement RetryTemplate with exponential backoff. | | PDFs with scanned images | Use TikaDocumentReader with OCR plugin (Tesseract). | | Token limit exceeded | Use TokenTextSplitter with overlap=100 tokens. | | Metadata tracking | Add Document metadata: put("source", pdfUrl) for provenance. | 6. Conclusion The combination of Spring AI (abstractions for LLM workflows), GitHub as a document source , and PDF parsing creates a powerful enterprise knowledge retrieval system. By following the ingestion and query patterns shown here, developers can build secure, context-aware AI applications that leverage existing documentation stored in GitHub repositories.